CN114038468B

CN114038468B - Voice data comparison processing method and device, electronic equipment and storage medium

Info

Publication number: CN114038468B
Application number: CN202210012225.3A
Authority: CN
Inventors: 张伟彬; 丁俊豪
Original assignee: Voiceai Technologies Co ltd
Current assignee: Shenzhen Digital Miracle Technology Co ltd; Voiceai Technologies Co ltd
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-15
Anticipated expiration: 2042-01-07
Also published as: CN114038468A

Abstract

The application relates to the technical field of voice processing, and discloses a voice data comparison processing method and device, electronic equipment and a storage medium. The voice data comparison processing method comprises the following steps: the method comprises the steps of obtaining first voice data and second voice data to be compared, displaying a first map representing the first voice data and a second map representing the second voice data in a first interface, marking a first voice fragment of the first voice data and a second voice fragment corresponding to the first voice fragment in the second voice data in response to a received fragment marking instruction, displaying a third map representing the first voice fragment in a second interface, and displaying a fourth map representing the second voice fragment, so that the marked voice data can be conveniently and quickly compared, and the comparison efficiency is improved.

Description

Voice data comparison processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method and an apparatus for comparing and processing speech data, an electronic device, and a storage medium.

Background

Voiceprint Identification (Voice Identification) is one of the biometric techniques, also known as Voice identity Identification. In the identification process, when different voice data need to be compared, for example, whether the different voice data are the same speaker needs to be confirmed, the related segments in the different voice data need to be copied respectively, the operation is complex, and the comparison efficiency is low.

Disclosure of Invention

In view of the foregoing problems, the present application provides a method, an apparatus, an electronic device, and a storage medium for comparing and processing voice data.

In a first aspect, an embodiment of the present application provides a method for comparing and processing voice data, where the method includes: the method comprises the steps of obtaining first voice data and second voice data to be compared, displaying a first map representing the first voice data and a second map representing the second voice data in a first interface, marking a first voice segment of the first voice data and a second voice segment corresponding to the first voice segment in the second voice data in response to a received segment marking instruction, displaying a third map representing the first voice segment in a second interface, and displaying a fourth map representing the second voice segment.

In a second aspect, an embodiment of the present application provides a device for comparing and processing voice data, where the device includes: the data acquisition module is used for acquiring first voice data and second voice data to be compared; the first display module is used for displaying a first map representing the first voice data and displaying a second map representing the second voice data in the first interface; the segment marking module is used for marking a first voice segment of the first voice data and a second voice segment corresponding to the first voice segment in the second voice data in response to the received segment marking instruction; and the second display module is used for displaying a third map representing the first voice fragment and a fourth map representing the second voice fragment in the second interface.

In a third aspect, embodiments of the present application provide an electronic device including one or more processors, a memory, and one or more application programs. Wherein one or more of the application programs are stored in the memory and configured to be executed by one or more of the processors, and one or more of the application programs are configured to perform the method for matching and processing voice data as provided in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code can be called by a processor to execute the voice data comparison processing method provided in the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart illustrating a voice data comparison processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a first interface in a voice data comparison processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart illustrating a voice data comparison processing method according to another embodiment of the present application;

fig. 4 is a schematic diagram illustrating a second interface in a voice data comparison processing method according to another embodiment of the present application;

FIG. 5 is a flow chart illustrating a voice data comparison process according to another embodiment of the present application;

fig. 6 is a schematic diagram illustrating a third interface in a voice data comparison processing method according to yet another embodiment of the present application;

fig. 7 is a flowchart illustrating a voice data comparison processing method according to yet another embodiment of the present application;

fig. 8 is a block diagram illustrating a voice data comparison processing apparatus according to an embodiment of the present application;

fig. 9 is a block diagram illustrating an electronic device according to an embodiment of the present application;

fig. 10 shows a block diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Voiceprint Identification (Voice Identification) is one of the biometric techniques, also known as Voice identity Identification. The voice identity identification is also called voiceprint identification, speaker identification and voice identification, and refers to scientific judgment on the identity problem of voice data recorded in sound image data through comparison and analysis. In the identification process, different voice data are often compared and analyzed, for example, the features (such as formant frequency, trend, etc.) of different voice data are compared one by one.

The current common identification method is to copy the voice data to be compared to comparison software respectively, then adjust and compare the characteristics of the voice one by one, and manually copy the comparison result to another software for typesetting. The method needs manual operation of a user, and if the comparison characteristics are not ideal, the user needs to repeatedly manually readjust and repeatedly copy and paste, so that a lot of time and energy are needed, the operation is complex, and the comparison efficiency is low.

Therefore, in order to overcome the above drawbacks, the inventors of the present application propose a method, an apparatus, an electronic device, and a storage medium for comparing and processing voice data, which relate to the technical field of voice processing. The voice data comparison processing method comprises the following steps: the method comprises the steps of obtaining first voice data and second voice data to be compared, displaying a first map representing the first voice data and a second map representing the second voice data in a first interface, marking a first voice fragment of the first voice data and a second voice fragment corresponding to the first voice fragment in the second voice data in response to a received fragment marking instruction, displaying a third map representing the first voice fragment in a second interface, and displaying a fourth map representing the second voice fragment, so that the marked voice data can be conveniently and quickly compared, and the comparison efficiency is improved.

Reference will now be made to specific embodiments.

Referring to fig. 1, fig. 1 shows a voice data comparison processing method provided in an embodiment of the present application, where the method can be applied to a terminal device. Specifically, the method may include steps S110 to S140.

Step S110, obtain the first voice data and the second voice data to be compared.

In some embodiments, the terminal device may acquire the first voice data and the second voice data in response to an operation by a user. In other embodiments, the mobile terminal may also receive the first voice data and the second voice data sent by the other device for comparison.

In some embodiments, the terminal device may read the voice data in a file stored locally in response to a selection operation by the user. In some embodiments, the terminal device may also read voice data from a database located locally or from a database located by another device. In some embodiments, the terminal device may also read voice data from a server or other device through the network interface.

In some embodiments, the terminal device may be, for example, a notebook computer, a desktop computer, a tablet computer, a smart phone, and the like, and a specific type of the terminal device may be selected according to an actual need, which is not limited in this application.

In one embodiment, the storage type of the file may be block storage, file storage, and object storage. The storage format of the file may be WAV (WaveForm file) file format, MP3 (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) file format, FLAC (Free Lossless Audio compression coding) file format, APE (Adaptive predictive coding) file format, and the like.

In one embodiment, the database may be a relational database, such as MySQL, SQL Server. The database may also be a non-relational database, such as MongoDB, Redis, Memcache.

In one embodiment, a server may refer to an individual server or a server cluster; the server can be a local server or a cloud server. The terminal equipment can be a smart phone, a notebook computer, a desktop computer, a tablet computer and the like.

Further, the Network used when reading voice data from a server or other terminal device through a Network interface is typically the internet, but may be any Network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless Network, private Network, or any combination of virtual private networks. In addition, data read from a server or other terminal devices through a Network interface may be transmitted through a specific communication protocol, which includes but is not limited to BLE (Bluetooth low energy) protocol, WLAN (Wireless Local Area Network) protocol, Bluetooth protocol, ZigBee (ZigBee) protocol, or Wi-Fi (Wireless Fidelity) protocol.

In some embodiments, the terminal device may receive voice data sent from another device for comparison, for example, after the terminal device is connected to a mobile phone, a user may select voice data to be compared on the mobile phone, and send the voice data to the terminal device through the mobile phone, so that the terminal device receives the voice data transmitted from the mobile phone for voice comparison.

In some embodiments, the connection between the terminal device and the other device may be a wired connection or a wireless connection. The wired connection medium may include, but is not limited to, an optical fiber, a coaxial cable, a twisted pair, etc., and the Wireless connection medium may include, but is not limited to, ZigBee (violet), WIFI (Wireless Fidelity), bluetooth, laser, infrared, etc.

In some embodiments, the number of the voice data to be compared may be multiple, that is, more than two voice data may be obtained for comparison, for example, the first voice data, the second voice data and the third voice data to be compared may be obtained. The embodiment of the present application takes two voice data to be compared as an example for explanation.

And step S120, displaying a first map representing the first voice data and a second map representing the second voice data in the first interface.

In an embodiment of the present application, the first map and the second map may employ various types of maps for representing characteristics of the voice data. For example, a spectrogram, a time domain graph, etc. may be adopted, and the type of the displayed spectrogram may be specifically set according to the actual use requirement, which is not limited in this application.

The voice spectrogram is a frequency spectrum analysis view, three-dimensional information is expressed by adopting a two-dimensional plane, the energy value can be represented by colors, and the deeper the color is, the stronger the voice energy of the point is represented. The spectrogram displayed can be a narrow-band spectrogram or a wide-band spectrogram. The narrow-band spectrogram can clearly display the structure of the harmonic wave and reflect the time-varying process of the fundamental frequency; the broadband spectrogram can clearly display a formant structure and reflect the rapid time-varying process of the frequency spectrum. In practical application, any one of the narrow-band spectrogram and the wide-band spectrogram can be selectively displayed according to different requirements, and the two spectrograms can also be simultaneously displayed.

The spectrogram can represent the volume of a piece of audio data at each frequency at a certain moment, wherein the horizontal axis represents the frequency, and the vertical axis represents the amplitude, and commonly comprises an amplitude spectrogram and a phase spectrogram.

Wherein, the time domain graph can describe the relation between a mathematical function or a physical signal and time, and the time domain waveform of one signal can express the change of the signal along with the time.

Specifically, the terminal device may display a first map corresponding to the acquired first voice data and a second map corresponding to the second voice data on the first interface, so that the comparison by the user is facilitated.

In the embodiment of the application, the display positions of the first map and the second map in the first interface may be set according to actual use requirements, and the display positions of the first map and the second map may be set according to actual requirements, which is not limited in this application.

In some embodiments, the first pattern and the second pattern may be located on upper and lower portions of the first interface, respectively. For example, a first map may be displayed in the upper half of the first interface and a second map in the lower half; alternatively, the second map may be displayed on the upper half portion of the first interface, and the first map may be displayed on the lower half portion of the first interface. Further, the first map and the second map may be aligned for the convenience of the user to compare, for example, the longitudinal axes of the first map and the second map may be located on the same straight line for the convenience of the user to compare.

In other embodiments, the first pattern and the second pattern may be located on the left and right portions of the first interface, respectively. For example, a first map may be displayed on the right half of the first interface and a second map on the left half; the second map may be displayed on the right half of the first interface, and the first map may be displayed on the left half.

In some embodiments, the first interface may include an image display area displaying the first atlas and the second atlas and a functional area. The function area may be used to display a function menu or the like.

In some embodiments, the user may adjust the first atlas and the second atlas at the first interface. For example: zooming in, zooming out, marking, etc. to facilitate voice comparison by the user.

In some embodiments, the user may also directly adjust the position of the first and second maps at the first interface. For example, the first map may be directly selected and dragged to a suitable location. As another example, the positions of the first map and the second map may be swapped. For example, the display area of the original first map in the first interface is set as an a area, and the display area of the original second map in the first interface is set as a B area, when the first map is selected, the first map is dragged to the area where the second map is displayed, and then the original display areas of the first map and the second map can be automatically exchanged, that is, the display area of the first map in the first interface is changed into the B area, and the display area of the second map in the first interface is changed into the a area.

In some embodiments, when the first voice data and the second voice data are too large to cause the corresponding first map and second map to be displayed in the respective display areas of the first interface in an incomplete manner, a slide bar may be set for the maps, and a user may manually drag the slide bar to change the display range of the first map and the second map. Illustratively, the total time of the first map is 1min, the display area of the first map can only display 40ms, that is, the display range is 40ms, so that the display time of the first map from 0ms to 40ms can be changed into 10ms to 50ms by dragging the slider, that is, the map of 0ms to 40ms displayed in the first interface of the first map is changed into the map corresponding to 10ms to 50 ms. The specific position of the slide bar can be set according to actual needs, and the slide bar is not limited.

Step S130, marking a first voice segment of the first voice data and a second voice segment corresponding to the first voice segment in the second voice data in response to the received segment marking instruction.

The fragment marking instruction is used for demarcating map fragments needing to be compared on a first map and a second map corresponding to the first voice data and the second voice data.

In some embodiments, a user may set a mark point on a map corresponding to the voice data, and determine a voice segment to be compared through the mark point. For example, the user may determine a time point on the map as the first marked point, for example, the user may determine the first marked point by clicking; and determining another time point as a second marking point according to the required segment range, for example, the user can click again to determine another marking point. And determining the marking time length according to the first marking point and the second marking point, wherein the map corresponding to the marking time length is the voice segment marked according to the segment marking instruction. For example, a position with time of 10ms may be determined as a first mark point in a first map corresponding to the first voice data, and then a position of 20ms may be determined as a second mark point, where the map corresponding to the interval of 10ms to 20ms is a first voice segment corresponding to the first voice data marked according to the segment marking instruction.

In some embodiments, the segment marking instruction may be that a time point is determined on a map corresponding to the voice data as a first marking point, for example, a user may determine the first marking point by clicking, where the first marking point is one end of a voice segment to be marked in the voice data. And selecting the first mark point and dragging to the other end of the voice segment needing to be marked, for example, when the user stops dragging and looses hands, the last contact position of the user is determined as the second mark point. The marking interval enclosed by the two ends is the voice segment obtained according to the segment marking instruction. Exemplarily, a position with time of 10ms may be determined as a first mark point in a first map corresponding to the first voice data, the first mark point is selected and dragged backwards to the position with time of 20ms, and a mark interval corresponding to 10ms to 20ms is a first voice segment obtained by marking the first voice data according to the segment marking instruction.

In some embodiments, marked marker points may be deleted as needed. For example, the first marker point is already determined at the 10ms position, but the first marker point is changed to the 12ms position, so the first marker point at the original 10ms position can be deleted, and the 12ms position is set as the first marker point.

After receiving the segment marking instruction, a first atlas segment needing to be marked in a first atlas corresponding to the first voice data can be selected to obtain a first voice segment needing to be marked in the first voice data, and then a second voice segment corresponding to the first voice segment is marked in a second atlas corresponding to the second voice data.

In some embodiments, the second speech segment may be determined by referring to a labeling method of the first speech segment, that is, a user may directly set a labeling point on a second map corresponding to the second speech data to determine a second speech segment corresponding to the first speech segment in the second speech data.

In some embodiments, the second speech segment may also be determined from the first speech segment. Alternatively, the second voice segment may be a segment similar to the first voice segment in the second voice data, for example, a segment with a similarity greater than a preset threshold with the first voice segment. Referring to fig. 2, for example, a first map 110 and a second map 120 are displayed in a first interface 100, a first speech segment a is determined in the first map 110, and a segment most similar to the first speech segment a is found in the second map 120 according to a map corresponding to the first speech segment a, and is labeled to obtain a second speech segment B.

In some embodiments, the second voice segment corresponding to the first voice segment may also be a segment with the same or similar pronunciation in the first voice data and the second voice data. For example, when it is detected that the first voice data is a map corresponding to the content "yes, i.e., xiaoming", and the second voice data is a map corresponding to the content "do not waste food", when a voice segment corresponding to the content "yes" is marked as a first voice segment in the first map corresponding to the first voice data, a segment having the highest similarity to the voice segment corresponding to the content "yes" is detected from the second voice data as the second voice segment. For example, when it is detected that the similarity between the voice segment corresponding to the content "food" in the second voice data and the content "yes" in the first voice data is the highest, the voice segment corresponding to the content "food" in the second voice data is marked as the second voice segment.

Further, when it is detected that a plurality of segments having the same or similar contents as the first voice segment in the first voice data exist in the second voice data, the user may select a segment having the closest atlas to the first voice segment from the segments having the same or similar pronunciation in the second atlas as the second voice segment. For example, the terminal device may display the detection result, for example, the terminal device may remind the user in the second map by highlighting, setting a mark, and the like, and the user selects one of the segments in the detection result as the second voice segment.

And step S140, displaying a third map representing the first voice fragment and a fourth map representing the second voice fragment in the second interface.

In some embodiments, the terminal device may switch the current first interface to the second interface. In other embodiments, the terminal device may also generate a floating window to display the second interface based on the current first interface, that is, display the first interface and the second interface at the same time, and the user may adjust the position and size of the floating window as needed.

In some embodiments, after determining the first voice segment and the second voice segment according to the segment marking instruction, the user may start a comparison operation, so as to trigger the terminal device to generate the second interface. In some embodiments, the terminal device may simultaneously display the first interface and the second interface, and when the first voice segment is determined at the first interface, the second interface may synchronously display a third map characterizing the first voice segment. When the second voice segment is determined on the second interface, the second interface can synchronously display a fourth map representing the second voice segment, so that the user can conveniently and quickly compare the fourth map, and the operation process is simplified.

In some embodiments, the comparison operation may be automatically started after the first voice segment and the second voice segment are determined, that is, the comparison operation is automatically started after the first voice segment and the second voice segment are detected, so as to compare the audio features of the first voice segment and the second voice segment.

Compared with the first map of the first voice data and the second map of the second voice data displayed on the first interface, the first voice segment of the first voice data and the second voice segment of the second voice data are displayed on the second interface, and maps corresponding to the marked first voice segment and the marked second voice segment can be seen more clearly on the second interface, so that the audio features of the two voice segments can be compared more clearly.

In some embodiments, the display positions of the third map and the fourth map in the second interface may be set according to needs, and are not limited herein.

In some embodiments, in addition to displaying the third map and the fourth map, the second interface may display a function menu, a directory, and the like.

In some embodiments, the user may also control the voice data corresponding to the third map and the fourth map to be played in the second interface. It can be understood that the audios corresponding to the third map and the fourth map may be played individually or simultaneously.

In the embodiment of the application, by acquiring the first voice data and the second voice data to be compared, displaying a first map representing the first voice data and a second map representing the second voice data in the first interface, marking a first voice segment of the first voice data and a second voice segment corresponding to the first voice segment in the second voice data in response to the received segment marking instruction, displaying a third map representing the first voice segment in the second interface, and displaying a fourth map representing the second voice segment, the marked voice data can be conveniently and quickly compared, and the comparison efficiency is improved.

Referring to fig. 3, fig. 3 shows another voice data comparison processing method according to an embodiment of the present application, in which a segment marking instruction includes a first marking instruction for first voice data and a second marking instruction for second voice data, and specifically, the method may include steps S210 to S250.

Step S210, obtain the first voice data and the second voice data to be compared.

Step S220, displaying a first map representing the first voice data and a second map representing the second voice data in the first interface.

In the embodiment of the present application, the content in the foregoing embodiment can be referred to for the specific description of step S210 to step S220, and is not repeated herein.

Step S230, determining a first target time period according to the first time period corresponding to the first marking instruction; a second target time period is determined from a second time period corresponding to the second marking instruction.

The terminal device can generate a first marking instruction in response to the marking operation of the user on the first voice data, and can determine a first time period marked by the user on the first voice data according to the first marking instruction. The first time period is a time range of a voice segment selected by the user in the first voice data.

The first target time period is a time range of the voice segments needing to be displayed in the first voice data determined according to the first time period.

Similarly, the terminal device may generate a second marking instruction in response to the marking operation of the second voice data by the user, and may determine a second time period marked by the second voice data by the user according to the second marking instruction. The second time period is a time range of a voice segment selected by the user in the second voice data.

The second target time period is a time range of the voice segments needing to be displayed in the second voice data determined according to the second time period.

In order to display the voice segments selected by the user, the first target time period comprises a first time period, the second target time period comprises a second time period, and in order to ensure the voice comparison effect, the first target time period is greater than the first time period, and the second target time period is greater than the second time period, so that omission can be avoided, and errors can be reduced.

The user's labeling operation on the voice data may cause the selected voice segment to be incomplete due to operation errors, which affects the comparison effect. In order to reduce the operation error, in some embodiments, the first preset time period may be determined according to a time range of the first time period, where the first preset time period includes a first pre-preset time period before the first time period and/or a first post-preset time period after the first time period, and the first time period and the first preset time period are used as the first target time period, so that the first target time period may cover a voice segment with a larger range than the first time period, thereby avoiding display omission and affecting the comparison effect.

Similarly, a second preset time period may be determined according to a time range of the second time period, the second time period and the second preset time period may be used as a second target time period, the second preset time period includes a second pre-preset time period before the second time period and/or a second post-preset time period after the second time period, and the second time period and the second preset time period may be used as the second target time period.

In some embodiments, the first preset time period may be determined according to a preset proportion of the time range of the first time period. For example, the first previous predetermined time period may be determined according to a first predetermined proportion of the time range of the first time period. The first post-preset time period may be determined according to a second preset proportion of the time range of the first time period. Similarly, the second preset time period may be determined according to a preset proportion of the time range of the second time period. For example, the second pre-set time period may be determined according to a third pre-set proportion of the time range of the second time period. The second later preset time period may be determined according to a fourth preset proportion of the time range of the second time period.

The specific numerical values of the first preset proportion, the second preset proportion, the third preset proportion and the fourth preset proportion can be selected according to actual needs, and the application is not limited to this.

Optionally, the first preset time period may include a first previous preset time period. Optionally, the first preset time period may also include a first later preset time period. Optionally, the first preset time period may also include a first previous preset time period and a first next preset time period at the same time. The selection can be specifically performed according to actual needs, and the application is not limited to this.

Optionally, the second preset time period may include a second previous preset time period. Optionally, the second preset time period may also include a second later preset time period. Optionally, the second preset time period may also include a second front preset time period and a second rear preset time period at the same time. The selection can be specifically performed according to actual needs, and the application is not limited to this.

Illustratively, if the time period corresponding to the first time period is from 00:00:20min to 00:00:30min, the time range is 10ms, the first target time period includes a first previous preset time period, and the first preset proportion is 0.1, the time range of the first previous preset time period is 1 ms. And the corresponding time period of the first preset time period is 00:00:19 min-00: 00:20min, so that the first target time period can be obtained according to the first time period and the first preset time period, namely 00:00:19 min-00: 00:30 min.

In some embodiments, before determining the first preset time period according to the time range of the first time period, the voice segment corresponding to the first time period may be further identified, and it is determined whether the voice segment is complete according to the identification result, so as to determine whether the first preset time period needs to be set; before the second preset time period is determined according to the time range of the second time period, the voice segment corresponding to the second time period can be identified, and whether the voice segment is complete or not is determined according to the identification result, so that the second preset time period is determined to be required to be set.

In some embodiments, the terminal device may perform speech recognition on the speech segment corresponding to the first time period, and determine whether the speech segment corresponding to the first time period is complete according to a recognition result. The terminal device can perform voice recognition on the voice segment corresponding to the second time period, and judge whether the voice segment corresponding to the second time period is complete according to the recognition result.

In some cases, if the speech segment is complete, the recognition result may be obtained from speech recognition. For example, the voice segment "me", the recognition result corresponding to "me" can be obtained through voice recognition. If the voice segment is incomplete, the voice segment may not be recognized according to the voice recognition, for example, the voice segment includes only a partial segment of "me", and the voice recognition may not obtain the recognition result because the partial segment is missing and the voice segment is incomplete. An incomplete speech segment may affect the speech comparison effect, and for this reason, a range larger than the selected time range in the speech segment may be further obtained for display. For example, a first pre-preset time period before the first time period and/or a first post-preset time period after the first time period are acquired, a second pre-preset time period before the second time period and/or a second post-preset time period after the second time period are acquired, so that the probability of normal speech recognition is increased.

In some cases, in a case that the time range of the selected speech segment is long, for example, the speech segment range includes at least two words of speech, in this case, if only part of the audio in the speech segment is incomplete, for example, in the speech segment of "us", the speech segment corresponding to "i" is complete, and the speech segment corresponding to "i" is missing. The speech segment may be subjected to a recognition result for recognizing at least a part of the speech segment based on the speech recognition, for example, "i me" may be recognized. But the recognition result does not conform to the time range of the voice segment, for example, the time range of the voice segment corresponding to 'our' is 2s, and the voice recognition result only has 'me', namely the recognition result corresponding to 1 word. Assuming that the average duration corresponding to each character is 1s, a 2s speech segment should correspond to the recognition result of 2 characters. If the duration of the voice segment is not matched with the content of the recognition result, it can be judged that the voice segment is partially missing, and the incomplete voice segment may affect the voice comparison effect, so that a range larger than the selected time range in the voice segment can be further obtained for display.

In some embodiments, after the voice segment corresponding to the first time period is identified, the time range of the first time period can be further acquired, and whether the first preset time period needs to be set is determined according to the identified content and the time range of the first time period; and after the voice segment corresponding to the second time period is identified, acquiring the time range of the second time period, and determining whether the first preset time period needs to be set according to the identified content and the time range of the second time period.

Specifically, after the voice segment corresponding to the first time period is identified, the terminal device obtains a time range of the first time period, determines an original time range corresponding to the identified content according to the identified content and the average duration of each character, compares the original time range with the time range of the first time period, and determines that the first preset time period needs to be set if the original time range is smaller than the time range of the first time period. The method for determining whether the second preset time period needs to be set is similar to the first preset time period, and is not described herein again.

Illustratively, the average duration of each character is set to be 1s, the content of the speech segment corresponding to the first time period is recognized, if the recognition result is "you", that is, the original time range corresponding to the recognition result is 1s, and the time range corresponding to the first time period is 2s, an incomplete speech segment exists in the first time period, so that a first front preset time period before the first time period and/or a first rear preset time period after the first time period can be obtained, and the probability of complete speech recognition is increased.

When the voice data are compared, the voice fragment with the complete meaning is more favorable for improving the comparison efficiency and accuracy, and the voice fragment determined according to the marking operation of the user is part of the voice data in the first voice data, and the determined part of the voice data may have the condition of incomplete semantics, so that the voice comparison effect is influenced.

For this reason, in some embodiments, the speech segments corresponding to the first time period may also be recognized, and segments similar to the context of the speech segments corresponding to the first time period are determined according to the recognition result, so as to determine a first preset time period according to the segments similar to the context, and the first time period and the first preset time period are taken as a first target time period; and recognizing the voice segment corresponding to the second time period, determining a second preset time period according to the recognition result, and taking the second time period and the second preset time period as a second target time period.

Specifically, the terminal device performs voice recognition on specific contents of a voice segment corresponding to a first time period, performs semantic judgment on an obtained voice recognition result, determines a first preset time period according to the semantics of the voice segment corresponding to the first time period, and then takes the first time period and the first preset time period as a first target time period; the terminal equipment performs voice recognition on specific contents of the voice segments corresponding to the first time period, performs semantic judgment on the obtained voice recognition result, determines a first preset time period according to the semantics of the voice segments corresponding to the first time period, and then takes the second time period and the second preset time period as a second target time period.

The first preset time period comprises a first front preset time period before the first time period and/or a first rear preset time period after the first time period; the second preset time period includes a second pre-preset time period before the second time period and/or a second post-preset time period after the second time period. Specifically, the first pre/post preset time period is a time period meeting the requirement of the context similarity determined according to the context of the speech recognition result of the first time period, and the second pre/post preset time period is a time period meeting the requirement of the context similarity determined according to the context of the speech recognition result of the second time period.

In some embodiments, the context correspondence table may be preset, and the context correspondence content may be stored in the context correspondence table in advance. Such as "eating", "apple", "milky tea", etc.

For example, the speech segment corresponding to the first time period is recognized, the recognition result is determined, the context corresponding content can be searched in the context corresponding table according to the recognition result, for example, if the recognition result is "meal", the context corresponding content can be searched for "eat meal", and then the first previous preset time period of the first time period is used as the first preset time period.

For example, the speech segment corresponding to the second time period may be further recognized, the recognition result is determined, the content corresponding to the context may be searched in the context corresponding table according to the recognition result, for example, if the recognition result is "apple", the content corresponding to the context may be searched, and then the speech segment of the first later preset time period of the second time period is obtained for recognition, and if the recognition result is "fruit", the second later preset time period of the second time period is used as the second preset time period.

In some embodiments, a neural network model for identifying context similarity may also be trained in advance for determining the context similarity of the speech segments.

Optionally, a speech segment corresponding to the first pre-preset time period and/or the first post-preset time period may be acquired, the acquired speech segment is identified, and the time period corresponding to the speech segment meeting the requirement of the context similarity is taken as the first preset time period. Similarly, the voice segments corresponding to the second front preset time period and/or the second rear preset time period may be acquired, the acquired voice segments are identified, and the time period corresponding to the voice segment meeting the requirement of the context similarity is taken as the second preset time period.

For example, the speech segment corresponding to the first time period, the speech segment corresponding to the first previous preset time period before the first time period, and the speech segment corresponding to the first next preset time period after the first time period may be processed through a pre-trained neural network model to obtain a first context similarity between the speech segment corresponding to the first time period and the speech segment corresponding to the first previous preset time period before the first time period, and a second context similarity between the speech segment corresponding to the first time period and the speech segment corresponding to the first next preset time period before the first time period, and if the context similarity of the speech segments is greater than a preset threshold, the time period corresponding to the speech segments is used as the first preset time period. For example, if the first context similarity is greater than the preset threshold, a time period corresponding to a first previous preset time period is taken as the first preset time period. And if the second context similarity is greater than the preset threshold, taking a time period corresponding to the first later preset time period as the first preset time period. For another example, if the first context similarity and the second context similarity are both greater than the preset threshold, the time period corresponding to the first previous preset time period and the time period corresponding to the first next preset time period are taken together as the first preset time period. The determination of the second preset time period may refer to the description of the first preset time period, and is not repeated herein.

Optionally, the preset threshold of the context similarity may be set according to usage requirements, for example, the preset threshold of the context similarity may be set to be 60%, the context similarity of the speech segment corresponding to the first time period in the first previous preset time period is identified, and if the context similarity is 80%, that is, the context similarity is greater than the preset threshold of 60%, the first previous preset time period is taken as the first preset time period.

In some embodiments, the time ranges of the first/second pre-preset time period and the first/second post-preset time period may refer to the description of the above embodiments, and are not repeated herein.

Step S240, selecting a voice segment corresponding to the first target time period in the first voice data as a first voice segment; and selecting a voice segment corresponding to the second target time period in the second voice data as a second voice segment.

The first voice segment is determined according to the first target time period in the first voice data. The second voice segment is a voice segment determined in the second voice data according to the second target time period.

In an embodiment of the present application, the terminal device may determine the first target voice segment according to the first time period of the first marking instruction. The first target voice segment comprises a voice segment corresponding to a first time period and a first preset time period. The terminal can determine a second target voice segment according to a second time period of the second marking instruction, wherein the second voice segment includes a voice segment corresponding to the second time period and a second preset time period.

In some embodiments, in order to make the first voice data and the second voice data have the same length for comparison, the lengths of the first period and the second period may be set to be the same, and the lengths of the predetermined time before and after the first period and the predetermined time before and after the second period may be set to be the same.

It can be understood that, the map corresponding to the voice segment of the preset time before and after the marked time period is displayed may be used to analyze the marked voice segment by combining the context before and after, for example, the context before and after the marked voice segment corresponding to the marked time period may be better understood when the corresponding voice data is played, which may facilitate the user to compare the first voice segment with the second voice segment.

And step S250, displaying a third map representing the first voice fragment and a fourth map representing the second voice fragment in the second interface.

Specifically, after the first voice segment and the second voice segment are determined, the terminal device displays a third map corresponding to the first voice segment and a fourth map corresponding to the second voice segment in the second interface.

In some embodiments, in the second interface, the third map may be displayed centered on the first time period and the fourth map may be displayed centered on the second time period.

Specifically, in the second interface, the third map is displayed centered around the marked first time period when the corresponding region in the second interface is displayed, and the fourth map is displayed centered around the marked second time period when the corresponding region in the second interface is displayed. Therefore, the user can conveniently compare the characteristics.

Further, as shown in fig. 4, in addition to the third map 210 and the fourth map 220 that are displayed centered around the first time period and centered around the second time period, the second interface 200 may also display LPC (Linear Predictive Coding) spectrums 230 of the first speech segment and the second speech segment in the third map and the fourth map, and further perform comparison analysis on the first speech segment and the second speech segment according to the LPC spectrums 230, for example, may obtain formant related parameters, such as a center frequency, a bandwidth, an intensity, a deviation ratio of LPC spectrums corresponding to the first speech segment and the second speech segment, and the like. Illustratively, as shown in fig. 4, the formant-related parameters 240 may be presented in the form of a graph or the like.

In some embodiments, after the third map and the fourth map are displayed in the second interface, the first target time period and the second target time period may be re-determined according to the adjustment instruction received by the second interface, and the corresponding third map and fourth map may be refreshed.

Specifically, the first target time period is redetermined in response to a first adjusting instruction received by the second interface, and the third map is refreshed according to the redetermined first target time period; and re-determining the second target time period in response to a second adjustment instruction received by the second interface, and refreshing the fourth map according to the re-determined second target time period.

Wherein the first adjustment instruction acts on the second interface to re-determine the first target time period such that the third map is refreshed in the second interface according to the re-determined first target time period; the second adjustment instructions act on the second interface to re-determine the second target time period such that the fourth map is refreshed in the second interface according to the re-determined second target time period. And re-determining the marked time period and refreshing the corresponding map on the second interface according to the adjusting instruction, so that the user can directly adjust the second interface without returning to the first interface to obtain the map corresponding to the re-marked time period again according to the segment marking instruction if the range of the comparison segment needs to be adjusted in the comparison process, thereby greatly reducing the time spent by the user, saving the operation process and improving the comparison efficiency.

Illustratively, the first target time period is originally 00:00:10 min-00: 00:20min, after the second interface receives the first adjusting instruction, it is determined that the first target time period needs to be adjusted to 00:00:05 min-00: 00:10min, then the map of 00:00:05 min-00: 00:10min in the first voice data is obtained according to the content of the first adjusting instruction, the second interface is refreshed, and the map obtained through adjustment is displayed in the middle of the area displayed by the original third map.

In some embodiments, when the second interface is refreshed according to the first adjustment instruction or the second adjustment instruction, in addition to refreshing only the corresponding third map or fourth map, two maps may be refreshed each time. When the second interface displays other contents in addition to the third map and the fourth map, the other contents may also be refreshed simultaneously, that is, all refreshed. The specific refresh area can be set as desired, and is not limited herein.

In the embodiment of the present application, a first time period corresponding to the first marking instruction and a voice segment of a predetermined time before and after the first time period are selected as the first voice segment, and a second time period corresponding to the second marking instruction and a voice segment of a predetermined time before and after the second time period are selected as the second voice segment. Therefore, the voice segments of the corresponding time period and the preset time before and after the time period are obtained according to the marking instruction, so that a user can better understand the context information before and after the voice segments corresponding to the marked time period when playing the audio on the second interface, and the user can conveniently compare the context information. In addition, the marked time period can be determined again and the corresponding map can be refreshed on the second interface according to the adjusting instruction, so that in the comparison process, if the range of the comparison segment needs to be adjusted, the user can directly adjust the second interface without returning to the first interface to obtain the map corresponding to the relabeled time period again according to the segment marking instruction, the time spent by the user is greatly reduced, the operation flow is saved, and the comparison efficiency is improved.

Referring to fig. 5, fig. 5 shows another voice data comparison processing method according to an embodiment of the present application. Specifically, the method may include steps S310 to S380.

Step S310, obtain the first voice data and the second voice data to be compared.

Step S320, displaying a first map representing the first voice data and a second map representing the second voice data in the first interface.

Step S330, marking a first voice segment of the first voice data and a second voice segment corresponding to the first voice segment in the second voice data in response to the received segment marking instruction.

And step S340, displaying a third map representing the first voice fragment and a fourth map representing the second voice fragment in the second interface.

In the embodiment of the present application, the content in the foregoing embodiment can be referred to for the specific description of step S310 to step S340, which is not described herein again.

And step S350, displaying at least one interface element for receiving the atlas parameters input by the user in the third interface.

Specifically, the map parameters act on a third map or a fourth map in the second interface, and the third map or the fourth map is adjusted according to the specific parameters; an interface element is an element that may be used to enable a user to input atlas parameters. It will be appreciated that each interface element corresponds to a map parameter, i.e. the map parameters entered by different users of the selected interface element will also differ.

In some embodiments, the map parameters may be a windowing type, a window length, a number of FFT (fast Fourier transform) points, an LPC order, a maximum bandwidth, a formant limit condition, and the like.

In some embodiments, one interface element for enabling a user to input a profile parameter may be displayed in the third interface, or two or more, i.e., a plurality of, interface elements for enabling a user to input a profile parameter may be displayed.

In some embodiments, the third interface is responsive to a respective trigger event in the second interface and is displayed in accordance with the trigger event.

Alternatively, a control for directly selecting and clicking the third interface in the second interface may be used.

Alternatively, the right mouse button may be clicked and the third interface may be selected for display in the explored dialog box.

Optionally, the method may further include setting a manner of displaying the third interface, and then displaying the third interface after the method is performed. For example, a third interface may be displayed after double-clicking on the display area of the map, and then the third interface is displayed after the user clicks on the display area of the third map or the fourth map in the second interface.

Step S360, map parameters are acquired in response to the input content received by the interface element.

And after the user selects the interface element, corresponding map parameters are obtained according to the corresponding content input in the interface element and the content. For example, after the user selects the window length interface element, the user inputs a window length value 40s into the interface element, and the window length of the obtained map is 40ms according to the value.

In some embodiments, the manner of inputting the content by the interface element may be directly inputting the content, or selecting a desired content from the given content, or adjusting the content by a slider, or in other manners. The specific manner of inputting the content by the interface element may be set according to the characteristics of the content to be input, which is not limited in the present application. Illustratively, as shown in fig. 6, the interface elements 300 in the third interface are a window length and an FFT point number, and a user may input a map parameter corresponding to the corresponding interface element.

And step S370, triggering an atlas updating instruction according to the atlas parameters.

Specifically, the map updating instruction is used for updating the map according to the parameter after the map parameter is acquired by the third interface.

In some embodiments, the manner of triggering the atlas update instruction may be directly triggered after the third interface obtains the atlas parameters. For example, after the window length acquired by the third interface is adjusted from 50ms to 40ms, the window length of the map is directly updated to 40ms according to the map updating instruction triggered by the window length of 40 ms.

In some embodiments, the map updating instruction may also be triggered by clicking to confirm after the map parameters are obtained from the third interface, and then triggering the map updating instruction to update the map.

Step S380, responding to the received map updating instruction, processing the voice data of the first voice segment and the second voice segment according to the map parameters corresponding to the map updating instruction, and correspondingly updating the third map and the fourth map.

In the embodiment of the application, after the second interface receives the map updating instruction, the voice data of the first voice segment and the second voice segment are processed according to the corresponding map parameters in the map updating instruction, and the third map and the fourth map are updated correspondingly.

Illustratively, if the original window length is 50ms, the time range of the first speech segment and the second speech segment is 50ms, and the time range of the first time segment and the second time segment is 30ms, the time range of the first preset time segment and the second preset time segment is 20 ms. And determining that the original window length of 50ms needs to be adjusted to 40ms according to the received map updating instruction, so that the time ranges of the first preset time period in the first voice segment and the second preset time period in the second voice segment need to be adjusted to 10 ms.

In the embodiment of the application, at least one interface element used for receiving the map parameters input by a user is displayed in the third interface, the map parameters are acquired in response to the input content received by the interface element, then the map updating instruction is triggered according to the map parameters, the second interface responds to the received map updating instruction, the voice data of the first voice segment and the second voice segment are processed according to the map parameters corresponding to the map updating instruction, and the third map and the fourth map are updated correspondingly. Therefore, the first voice segment and the second voice segment displayed in the second window can be associated with the original first voice data and the original second voice data, so that the processing of the first voice segment and the second voice segment in the second window is synchronous with the original voice data without returning to the first interface again to process the original voice data, and the comparison efficiency is improved.

Referring to fig. 7, fig. 7 shows another voice data comparison processing method according to an embodiment of the present application. Specifically, the method comprises the following steps: s410 to S460.

Step S410, obtain the first voice data and the second voice data to be compared.

And step S420, displaying a first map representing the first voice data and a second map representing the second voice data in the first interface.

Step S430, marking a first voice segment of the first voice data and a second voice segment corresponding to the first voice segment in the second voice data in response to the received segment marking instruction.

And step S440, displaying a third map representing the first voice segment and a fourth map representing the second voice segment in the second interface.

In the embodiment of the present application, the content in the foregoing embodiment can be referred to for the specific description of step S410 to step S440, which is not described herein again.

And step S450, responding to the received screenshot instruction, and generating a screenshot at least comprising a third map and a fourth map.

Specifically, the screenshot instruction is used for screenshot of the content displayed in the second interface, and the screenshot at least comprises a third map and a fourth map.

In some embodiments, the screenshot instruction may be a preset shortcut operation, for example, the screenshot may be performed by setting a key Alt + D + F.

In some embodiments, the screenshot instruction may also perform screenshot according to a screenshot operation voice of the user, for example, screenshot in response to the screenshot instruction when it is detected that the user says "screenshot". The screenshot operation voice can be obtained through an audio acquisition device (such as a microphone, a microphone array, and the like).

In some embodiments, the screenshot may be generated by directly generating the screenshot comprising the third map and the fourth map after responding to the screenshot command.

In some embodiments, the screenshot may be generated by stopping on the screenshot interface after responding to the screenshot command, and the user may select an area in the screenshot interface that needs to be screenshot as needed.

Furthermore, a confirmation control can be arranged after the screenshot area is selected, and is used for confirming that the screenshot area does not need to be generated correctly. Further, when the screenshot area is wrong, the user can reselect the area needing screenshot.

And step S460, copying the screenshot to a clipboard or saving the screenshot to a file.

In some embodiments, after generating the screenshot, the user may choose to copy the screenshot to a clipboard or save the screenshot to a file for subsequent viewing of the screenshot.

In some embodiments, copying the screenshot to the clipboard may be followed by pasting the screenshot, for example, into a document, or other software, such as a WeChat, a nail, etc., that is sent to other users.

In the embodiment of the application, the second interface responds to the received screenshot command, generates a screenshot at least comprising a third map and a fourth map, and copies the screenshot to a clipboard or stores the screenshot to a file. Therefore, the third map and the fourth map under different conditions can be stored in a screenshot mode, so that a subsequent user can conveniently search and compare the third map and the fourth map under different conditions, and the comparison efficiency is improved.

Referring to fig. 8, fig. 8 is a block diagram illustrating a voice data comparison processing apparatus 400 according to an embodiment of the present disclosure. The voice data comparing and processing apparatus 400 includes a data obtaining module 410, a first display module 420, a segment marking module 430, and a second display module 440.

Specifically, the data obtaining module 410 is configured to obtain first voice data and second voice data to be compared.

The first display module 420 is configured to display a first map representing the first voice data and a second map representing the second voice data in the first interface.

The segment marking module 430 is configured to mark a first voice segment of the first voice data and a second voice segment corresponding to the first voice segment in the second voice data in response to the received segment marking instruction.

And a second display module 440, configured to display a third map representing the first voice segment and a fourth map representing the second voice segment in the second interface.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 9, fig. 9 is a block diagram illustrating an electronic device 500 according to an embodiment of the present disclosure. The electronic device 500 may be a PC computer, a mobile terminal, or other electronic devices capable of running an application. The electronic device 500 in the present application may include one or more of the following components: a processor 510, a memory 520, and one or more applications, wherein the one or more applications may be stored in the memory 520 and configured to be executed by the one or more processors 510, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 510 may include one or more processing cores. The processor 510 interfaces with various components throughout the electronic device 500 using various interfaces and circuitry to perform various functions of the electronic device 500 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 520 and invoking data stored in the memory 520. Alternatively, the processor 510 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 510 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 510, but may be implemented by a communication chip.

The Memory 520 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 520 may be used to store instructions, programs, code sets, or instruction sets. Memory 520 may include a stored program area and a stored data area, where the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a get data function, a mark function, a screenshot function, etc.), instructions for implementing various method embodiments described below, and the like. The stored data area may also store data (such as a first voice snippet, a second voice snippet, a third graph, a fourth graph, a screenshot, graph parameters, etc.) created by the electronic device 500 in use.

Referring to fig. 10, fig. 10 is a block diagram illustrating a structure of a computer readable storage medium according to an embodiment of the present disclosure. The computer readable storage medium 600 stores a program code, which can be called by the processor to execute the voice data matching processing method described in the above method embodiments.

The computer-readable storage medium 600 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 600 includes a non-volatile computer-readable storage medium. The computer readable storage medium 600 has storage space for program code 610 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 610 may be compressed, for example, in a suitable form.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the voice data comparison processing method described in the above-mentioned various optional embodiments.

The application discloses a voice data comparison processing method, a voice data comparison processing device and a storage medium, and relates to the technical field of voice processing. The voice data comparison processing method comprises the following steps: the method comprises the steps of obtaining first voice data and second voice data to be compared, displaying a first map representing the first voice data and a second map representing the second voice data in a first interface, marking a first voice fragment of the first voice data and a second voice fragment corresponding to the first voice fragment in the second voice data in response to a received fragment marking instruction, displaying a third map representing the first voice fragment in a second interface, and displaying a fourth map representing the second voice fragment, so that the marked voice data can be conveniently and quickly compared, and the comparison efficiency is improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for comparing and processing voice data is characterized by comprising the following steps:

acquiring first voice data and second voice data to be compared;

displaying a first map representing the first voice data and a second map representing the second voice data in a first interface;

determining a first target time period according to the first time period corresponding to the first marking instruction; determining a second target time period according to a second time period corresponding to the second marking instruction;

wherein the first marking instruction is generated in response to a marking operation of a user on first voice data, the second marking instruction is generated in response to a marking operation of a second voice data, a time range of the first target time period includes and is greater than a time range of the first time period, and a time range of the second target time period includes and is greater than a time range of the second time period;

selecting a voice segment corresponding to the first target time period in the first voice data as a first voice segment; selecting a voice segment corresponding to the second target time period in the second voice data as a second voice segment; and

and displaying a third map representing the first voice fragment and a fourth map representing the second voice fragment in a second interface.

2. The method of claim 1, wherein determining a first target time period according to a first time period corresponding to the first marking instruction and determining a second target time period according to a second time period corresponding to the second marking instruction comprises:

determining a first preset time period according to the time range of the first time period; the first preset time period comprises a first pre-preset time period before the first time period and/or a first post-preset time period after the first time period;

taking the first time period and the first preset time period as a first target time period;

determining a second preset time period according to the time range of the second time period; the second preset time period comprises a second pre-preset time period before the second time period and/or a second post-preset time period after the second time period;

and taking the second time period and the second preset time period as a second target time period.

3. The method of claim 1, wherein determining a first target time period according to a first time period corresponding to the first marking instruction and determining a second target time period according to a second time period corresponding to the second marking instruction comprises:

recognizing the voice segment corresponding to the first time period, and determining a first preset time period according to a recognition result; the first preset time period comprises a first pre-preset time period before the first time period and/or a first post-preset time period after the first time period;

recognizing the voice segment corresponding to the second time period, and determining a second preset time period according to a recognition result; the second preset time period comprises a second pre-preset time period before the second time period and/or a second post-preset time period after the second time period;

4. The method for matching and processing speech data according to claim 1, wherein the displaying a third map representing the first speech segment includes:

centering the third map for display about the first time period; and

the displaying a fourth map characterizing the second speech segment, comprising:

centering the fourth map for display centered on the second time period.

5. The method for matching and processing voice data according to claim 1, further comprising:

re-determining the first target time period in response to a first adjustment instruction received by the second interface, and refreshing the third map according to the re-determined first target time period;

and/or, re-determining the second target time period in response to a second adjustment instruction received by the second interface, and refreshing the fourth map according to the re-determined second target time period.

6. The method for matching and processing voice data according to claim 1, further comprising:

and responding to the received map updating instruction, processing the voice data of the first voice segment and the second voice segment according to the map parameters corresponding to the map updating instruction, and correspondingly updating the third map and the fourth map.

7. The method for matching and processing voice data according to claim 6, wherein the method further comprises:

displaying at least one interface element for receiving user-entered atlas parameters in a third interface; and

obtaining the atlas parameter in response to input received by the interface element; and

and triggering the map updating instruction according to the map parameters.

8. The method for matching and processing speech data according to any one of claims 1 to 7, further comprising:

generating a screenshot at least comprising the third map and the fourth map in response to the received screenshot instruction; and

and copying the screenshot to a clipboard or saving the screenshot to a file.

9. A voice data comparison processing device is characterized by comprising:

the data acquisition module is used for acquiring first voice data and second voice data to be compared;

the first display module is used for displaying a first map representing the first voice data and displaying a second map representing the second voice data in a first interface;

the segment marking module is used for determining a first target time period according to a first time period corresponding to the first marking instruction; determining a second target time period according to a second time period corresponding to the second marking instruction; wherein the first marking instruction is generated in response to a marking operation of a user on first voice data, the second marking instruction is generated in response to a marking operation of a second voice data, a time range of the first target time period includes and is greater than a time range of the first time period, and a time range of the second target time period includes and is greater than a time range of the second time period; selecting a voice segment corresponding to the first target time period in the first voice data as a first voice segment; selecting a voice segment corresponding to the second target time period in the second voice data as a second voice segment; and

and the second display module is used for displaying a third map representing the first voice fragment and a fourth map representing the second voice fragment in a second interface.

10. An electronic device, comprising:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more application programs configured to perform the method of speech data alignment processing of any of claims 1-8.

11. A computer-readable storage medium, wherein a program code is stored in the computer-readable storage medium, and the program code can be invoked by a processor to execute the method for matching and processing voice data according to any one of claims 1 to 8.