CN117975949B

CN117975949B - Event recording method, device, equipment and medium based on voice conversion

Info

Publication number: CN117975949B
Application number: CN202410366053.9A
Authority: CN
Inventors: 朱磊; 卢骁; 陈裕妙; 陈楠; 蒋志立
Original assignee: Hangzhou Weican Technology Co ltd
Current assignee: Hangzhou Weican Technology Co ltd
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2024-06-07
Anticipated expiration: 2044-03-28
Also published as: CN117975949A

Abstract

The invention relates to the technical field of artificial intelligence, and provides an event recording method, device, equipment and medium based on voice conversion, which can start an event recording interface corresponding to the event type of a target event, and when detecting that data is input in a designated input box of the event recording interface, check the input data so as to ensure that the follow-up processing is performed on the premise of correct basic information; when the input data passes the verification, user voice is collected in real time, the user voice is optimized based on the frequency band signal intensity, the data to be processed is split according to the configuration splitting strategy to obtain data to be converted, the data to be converted is input into a pre-trained multi-language conversion model to obtain a target text, and the target text is inserted into a designated area of an event recording interface, so that voice is converted into text in real time based on an artificial intelligence means, and the accuracy and the processing efficiency of event recording are improved.

Description

Event recording method, device, equipment and medium based on voice conversion

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for recording events based on voice conversion.

Background

Currently, for handling events such as drunk driving, it is generally required that related staff members record on site by using a voice recorder, a video camera, etc., and analyze, summarize, and sort after the event so as to store on-site conditions or inquiry contents in the form of text.

Because the inquired personnel have different accents, certain trouble is caused to related personnel in the process of summarizing and arranging the on-site recording into a text form, and the accuracy and the processing efficiency of the finally obtained text are low.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a method, a device and a medium for recording events based on voice conversion, which aim to solve the problems of low event recording efficiency and low accuracy.

A voice conversion based event recording method, the voice conversion based event recording method comprising:

Responding to a recording instruction of a target event, acquiring an event type of the target event, and starting an event recording interface corresponding to the event type;

when detecting that data is input in a designated input box of the event recording interface, checking the input data;

When the input data passes the verification, the voice of the user is collected in real time;

optimizing the user voice based on the frequency band signal intensity to obtain data to be processed;

splitting the data to be processed according to a configuration splitting strategy to obtain data to be converted;

Inputting the data to be converted into a pre-trained multilingual conversion model to obtain a target text;

and inserting the target text into a designated area of the event recording interface.

According to a preferred embodiment of the present invention, the optimizing the user voice based on the frequency band signal strength includes:

converting the user voice into a digital signal to obtain a first signal;

identifying a high-band signal from the first signal;

The frequency spectrum of the high-frequency band signal is improved, and a second signal is obtained;

And acquiring a preset threshold value, and denoising the second signal based on the preset threshold value to obtain the data to be processed.

According to a preferred embodiment of the present invention, splitting the data to be processed according to the configuration splitting policy, to obtain the data to be converted includes:

Acquiring user tone colors corresponding to the data to be processed, and carrying out primary splitting on the data to be processed according to the user tone colors to obtain data segments corresponding to each user tone color;

Acquiring a pause time threshold, and carrying out secondary splitting on the data segments corresponding to the tone colors of each user according to the pause time threshold to obtain each first sub-data segment;

Acquiring a pre-established dictionary, and fusing each first sub-data segment by using the dictionary to obtain a plurality of second sub-data segments;

Acquiring the starting time and the ending time of each second sub-data segment;

and marking each second sub-data segment according to the starting time and the ending time of each second sub-data segment to obtain the data to be converted.

According to a preferred embodiment of the present invention, the inputting the data to be converted into a pre-trained multilingual conversion model, and obtaining the target text includes:

Sequentially inputting each second sub-data segment in the data to be converted into the multi-language conversion model according to the marked starting time to obtain a text corresponding to each second sub-data segment;

sequentially combining texts corresponding to each second sub-data segment according to the starting time and the ending time of the mark to obtain the target text;

The multi-language conversion model is obtained by training a two-way long-short-term memory neural network based on a plurality of language samples.

According to a preferred embodiment of the invention, the method further comprises:

Responding to the synchronous recording instruction, and detecting the user type;

when the user type is a preset type, sending out prompt information of whether to open the accompanying camera;

When receiving a confirmation signal fed back based on the prompt information, determining a nursing room in which a nursing staff of the user is positioned;

starting a accompany camera of the accompany room;

acquiring a video acquired by a camera of an area where the user is located in real time as an initial video;

inserting pictures acquired by the accompanying camera in real time into the initial video in a picture-in-picture mode to obtain a recorded video;

uploading the recorded video to the appointed position of the event recording interface in an accessory mode.

determining a forensic type in response to the video forensic instruction;

when the indication type is a common indication, acquiring a selected camera as a target camera, and displaying a picture acquired by the target camera in real time; or alternatively

When the indication type is the article indication, acquiring the selected article as a target article, displaying an image of the target article, and projecting a real-time video of the target article to a designated display.

According to a preferred embodiment of the present invention, after the target text is inserted into the designated area of the event recording interface, the method further includes:

Generating an event record file according to the event record interface;

When the event record file is of a remote inquiry type, acquiring a first real-time video of a local inquiry room and acquiring a second real-time video of the remote inquiry room; simultaneously displaying the first real-time video, the second real-time video, and the target text on a display of the local interrogation room and a display of the remote interrogation room; or alternatively

When the event record file is of a specified type, marking the event record file, and displaying a label adding prompt in a form of a bullet frame; when an analysis instruction for any label is received, displaying detailed information of the any label; when receiving an update instruction for any tag, executing a deleting operation or a modifying operation for the any tag according to the update instruction; and when the arbitrary label is of the instant messaging type and a query instruction for the arbitrary label is received, displaying an instant messaging message record.

A speech-conversion-based event recording device, the speech-conversion-based event recording device comprising:

The starting unit is used for responding to a recording instruction of a target event, acquiring the event type of the target event and starting an event recording interface corresponding to the event type;

The verification unit is used for verifying the input data when detecting that the data is input in the appointed input box of the event recording interface;

the acquisition unit is used for acquiring user voice in real time when the input data passes the verification;

The optimizing unit is used for optimizing the user voice based on the frequency band signal intensity to obtain data to be processed;

The splitting unit is used for splitting the data to be processed according to a configuration splitting strategy to obtain data to be converted;

the input unit is used for inputting the data to be converted into a multi-language conversion model trained in advance to obtain a target text;

And the inserting unit is used for inserting the target text into the appointed area of the event recording interface.

A computer device, the computer device comprising:

a memory storing at least one instruction; and

And the processor executes the instructions stored in the memory to realize the event recording method based on voice conversion.

A computer-readable storage medium having stored therein at least one instruction for execution by a processor in a computer device to implement the speech conversion based event recording method.

According to the technical scheme, the event recording interface corresponding to the event type of the target event can be started, and when the data input in the appointed input box of the event recording interface is detected, the input data is checked so as to ensure that the subsequent processing is carried out on the premise of correct basic information; when the input data passes the verification, user voice is collected in real time, the user voice is optimized based on the frequency band signal intensity, the data to be processed is split according to the configuration splitting strategy to obtain data to be converted, the data to be converted is input into a pre-trained multi-language conversion model to obtain a target text, and the target text is inserted into a designated area of an event recording interface, so that voice is converted into text in real time based on an artificial intelligence means, and the accuracy and the processing efficiency of event recording are improved.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of a speech conversion based event recording method of the present invention.

FIG. 2 is a functional block diagram of a preferred embodiment of a speech conversion based event recording device of the present invention.

Fig. 3 is a schematic structural diagram of a computer device for implementing a voice conversion-based event recording method according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

FIG. 1 is a flow chart of a preferred embodiment of a speech conversion based event recording method of the present invention. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs.

The event recording method based on voice conversion is applied to one or more computer devices, wherein the computer device is a device capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the computer device comprises, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device and the like.

The computer device may be any electronic product that can interact with a user in a human-computer manner, such as a Personal computer, a tablet computer, a smart phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a game console, an interactive internet protocol television (Internet Protocol Television, IPTV), a smart wearable device, etc.

The computer device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.

The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.

S10, responding to a recording instruction of a target event, acquiring the event type of the target event, and starting an event recording interface corresponding to the event type.

In this embodiment, the target event may include, but is not limited to: drunk driving event, driving without license event, etc.

In this embodiment, the recording instruction may be triggered when the login information is detected, or may be triggered when the selection of the specified key is detected.

In this embodiment, the event types may include, but are not limited to: query type, recognition type, search type, etc.

In this embodiment, the event recording interface may be an operation interface for specifying an application program, or may be an operation web page.

In this embodiment, since the files and recording forms required to be recorded for different event types are different, different event recording interfaces may be preconfigured for each event type, so as to perform efficient and comprehensive recording on different event types.

S11, when the fact that data are input in the designated input box of the event recording interface is detected, checking the input data.

In this embodiment, the specified input box may be an input box corresponding to a filling-in-need item of the configured basic information, for example, the specified input box may include an input box corresponding to a filling-in-need item such as an event number, an event name, an event type, an inquiry room, a location, a processing person, and the like.

In this embodiment, the verifying the input data includes:

carrying out integrity check on the input data; and/or

Performing conflict verification on the input data; and/or

And carrying out format verification on the input data.

In the above embodiment, the integrity check can ensure that all the necessary entries are filled, the conflict check can avoid resource waste and information errors caused by repeated recording of the same event, and the format check can ensure that each necessary entry inputs data in a correct form, thereby avoiding data recording errors.

And S12, when the input data passes the verification, the user voice is collected in real time.

In this embodiment, when the input data fails to pass the verification, an error prompt is sent to assist relevant personnel in timely processing.

In this embodiment, a voice acquisition device such as a microphone may be used to acquire user voice in real time.

And S13, optimizing the user voice based on the frequency band signal intensity to obtain data to be processed.

In this embodiment, optimizing the user voice based on the frequency band signal strength includes:

converting the user voice into a digital signal to obtain a first signal;

identifying a high-band signal from the first signal;

The user voice can be converted into a digital signal which can be recognized and processed by a machine.

The frequency spectrum of the high-frequency band signal is improved, so that the energy distribution of each frequency band can be more balanced, and the accuracy of subsequent voice conversion can be improved.

The preset threshold value can be configured according to actual scene requirements.

The interference of invalid data can be reduced by denoising, and the accuracy of subsequent voice conversion is further improved.

S14, splitting the data to be processed according to a configuration splitting strategy to obtain the data to be converted.

In this embodiment, splitting the data to be processed according to the configuration splitting policy includes:

Wherein, the pause time threshold can be configured according to user habit, such as 3 seconds. When the user does not speak for more than 3 seconds, the previous sentence is ended by default.

In the above embodiment, first, the data to be processed is split according to tone color to distinguish the audio of different users; secondly, carrying out secondary splitting by using a pause time threshold value so as to realize reasonable sentence breaking of the voice; and finally, fusing the first sub-data segment obtained after splitting according to the pause time threshold by utilizing a pre-established dictionary to avoid sentence breaking errors caused by overlong pause of a user or language habits of the user (for example, when the user says "I like eating" and pauses for 4 seconds and then says "tomatoes", the "I like eating" and the "tomatoes" are split into two segments according to the pause time threshold for 3 seconds, but after fusing by utilizing the dictionary, the "I like eating tomatoes" can be fused into one sentence), so that sentence breaking is more reasonable. Further, each second sub-data segment is marked according to the start time and the end time, so that each data segment has a definite time sequence.

S15, inputting the data to be converted into a multi-language conversion model trained in advance to obtain a target text.

In this embodiment, the inputting the data to be converted into a pre-trained multilingual conversion model, and obtaining the target text includes:

The multi-language conversion model can support conversion of mandarin, english and various dialects, and the language types supported by the multi-language conversion model mainly depend on selection of training samples.

The two-way long-short-term memory neural network can effectively guarantee time sequence.

In the above embodiment, the voice can be converted and combined according to the order of speaking by the user, so that the output text can be displayed according to the order of speaking by the user, and further the purpose of converting the voice of the user into the text in real time can be achieved, so that the efficiency of event recording is improved, and the accuracy of recording can be ensured based on an artificial intelligence means.

S16, inserting the target text into the appointed area of the event recording interface.

In the above embodiment, after the target text is inserted into the designated area of the event recording interface, the event can be quickly recorded.

In this embodiment, after the target text is inserted into the specified area of the event recording interface, the method further includes:

Generating an event record file according to the event record interface;

The appointed type can be configured according to actual requirements.

In the above embodiment, when the event record file is of a remote query type, real-time video and text obtained after voice conversion of both sides can be simultaneously displayed on the displays of the local query room and the remote query room, so as to assist in better remote query; when the event record file is of a specified type, the record file of the type can be supported to be marked, and a label is added to assist related personnel to quickly acquire the related event information, support to change the event information and view related instant messaging records.

In this embodiment, the method further includes:

starting a accompany camera of the accompany room;

The preset types can comprise personnel types needing accompanying, such as elders, children and the like.

When the synchronous recording instruction is received, a user can be prompted to select the number of recordings.

Through the embodiment, synchronous recording of videos can be realized, and meanwhile, for special personnel types needing accompanying, videos of accompanying personnel can be recorded simultaneously, so that more comprehensive recording is assisted, and the video recording device is convenient to use when carrying out event analysis according to the videos.

In this embodiment, the method further includes:

determining a forensic type in response to the video forensic instruction;

Through the embodiment, targeted evidence indication according to the selected camera and the selected object can be supported.

FIG. 2 is a functional block diagram of a preferred embodiment of a speech conversion based event recording device according to the present invention. The event recording device 11 based on voice conversion comprises a starting unit 110, a checking unit 111, an acquisition unit 112, an optimizing unit 113, a splitting unit 114, an input unit 115 and an inserting unit 116. The module/unit referred to in the present invention refers to a series of computer program segments, which are stored in a memory, capable of being executed by a processor and of performing a fixed function. In the present embodiment, the functions of the respective modules/units will be described in detail in the following embodiments.

The starting unit 110 is configured to obtain an event type of a target event in response to a recording instruction of the target event, and start an event recording interface corresponding to the event type;

The verification unit 111 is configured to, when detecting that data is input in a specified input box of the event recording interface, verify the input data;

The collecting unit 112 is configured to collect user voice in real time when the input data passes the verification;

The optimizing unit 113 is configured to optimize the user voice based on the frequency band signal strength, so as to obtain data to be processed;

the splitting unit 114 is configured to split the data to be processed according to a configuration splitting policy to obtain data to be converted;

The input unit 115 is configured to input the data to be converted into a pre-trained multilingual conversion model, so as to obtain a target text;

The inserting unit 116 is configured to insert the target text into a specified area of the event recording interface.

Fig. 3 is a schematic structural diagram of a computer device according to a preferred embodiment of the present invention for implementing a voice conversion based event recording method.

The computer device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program stored in the memory 12 and executable on the processor 13, such as an event recording program based on speech conversion.

It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the computer device 1 and does not constitute a limitation of the computer device 1, the computer device 1 may be a bus type structure, a star type structure, the computer device 1 may further comprise more or less other hardware or software than illustrated, or a different arrangement of components, for example, the computer device 1 may further comprise an input-output device, a network access device, etc.

It should be noted that the computer device 1 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.

The memory 12 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 12 may in some embodiments be an internal storage unit of the computer device 1, such as a removable hard disk of the computer device 1. The memory 12 may also be an external storage device of the computer device 1 in other embodiments, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the computer device 1. The memory 12 may be used not only for storing application software installed in the computer device 1 and various types of data, such as codes of event recording programs based on voice conversion, but also for temporarily storing data that has been output or is to be output.

The processor 13 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, various control chips, and the like. The processor 13 is a Control Unit (Control Unit) of the computer device 1, connects the respective components of the entire computer device 1 using various interfaces and lines, executes various functions of the computer device 1 and processes data by running or executing programs or modules stored in the memory 12 (for example, executing an event recording program based on voice conversion, etc.), and calls data stored in the memory 12.

The processor 13 executes the operating system of the computer device 1 and various types of applications installed. The processor 13 executes the application program to implement the steps of the various speech conversion based event recording method embodiments described above, such as the steps shown in fig. 1.

Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 12 and executed by the processor 13 to complete the present invention. The one or more modules/units may be a series of computer readable instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device 1. For example, the computer program may be divided into a start-up unit 110, a verification unit 111, an acquisition unit 112, an optimization unit 113, a splitting unit 114, an input unit 115, an insertion unit 116.

The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional module is stored in a storage medium, and includes instructions for causing a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a processor (processor) to execute portions of the speech conversion based event recording method according to the embodiments of the present invention.

The modules/units integrated in the computer device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on this understanding, the present invention may also be implemented by a computer program for instructing a relevant hardware device to implement all or part of the procedures of the above-mentioned embodiment method, where the computer program may be stored in a computer readable storage medium and the computer program may be executed by a processor to implement the steps of each of the above-mentioned method embodiments.

Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory, or the like.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one straight line is shown in fig. 3, but not only one bus or one type of bus. The bus is arranged to enable a connection communication between the memory 12 and at least one processor 13 or the like.

Although not shown, the computer device 1 may further comprise a power source (such as a battery) for powering the various components, preferably the power source may be logically connected to the at least one processor 13 via a power management means, whereby the functions of charge management, discharge management, and power consumption management are achieved by the power management means. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The computer device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described in detail herein.

Further, the computer device1 may also comprise a network interface, optionally comprising a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the computer device1 and other computer devices.

The computer device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the computer device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

Fig. 3 shows only a computer device 1 with components 12-13, it being understood by those skilled in the art that the structure shown in fig. 3 is not limiting of the computer device 1 and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

In connection with fig. 1, the memory 12 in the computer device 1 stores a plurality of instructions to implement a speech conversion based event logging method, the processor 13 being executable to implement:

Specifically, the specific implementation method of the above instructions by the processor 13 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.

The data in this case were obtained legally.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The invention is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. The units or means stated in the invention may also be implemented by one unit or means, either by software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A voice conversion-based event recording method, comprising:

Splitting the data to be processed according to a configuration splitting strategy to obtain data to be converted, wherein the splitting strategy comprises the following steps: acquiring user tone colors corresponding to the data to be processed, and carrying out primary splitting on the data to be processed according to the user tone colors to obtain data segments corresponding to each user tone color; acquiring a pause time threshold, and carrying out secondary splitting on the data segments corresponding to the tone colors of each user according to the pause time threshold to obtain each first sub-data segment; acquiring a pre-established dictionary, and fusing each first sub-data segment by using the dictionary to obtain a plurality of second sub-data segments; acquiring the starting time and the ending time of each second sub-data segment; marking each second sub-data segment according to the starting time and the ending time of each second sub-data segment to obtain the data to be converted;

2. The voice conversion based event recording method as claimed in claim 1, wherein optimizing the user voice based on the frequency band signal strength to obtain the data to be processed comprises:

converting the user voice into a digital signal to obtain a first signal;

identifying a high-band signal from the first signal;

3. The speech conversion based event recording method according to claim 1, wherein the inputting the data to be converted into a pre-trained multilingual conversion model to obtain a target text comprises:

4. The speech-based event logging method of claim 1, wherein the method further comprises:

starting a accompany camera of the accompany room;

5. The speech-based event logging method of claim 1, wherein the method further comprises:

determining a forensic type in response to the video forensic instruction;

6. The voice conversion based event recording method according to claim 1, wherein after the target text is inserted into a designated area of the event recording interface, the method further comprises:

Generating an event record file according to the event record interface;

7. A speech conversion based event recording apparatus, the speech conversion based event recording apparatus comprising:

The splitting unit is configured to split the data to be processed according to a configuration splitting policy to obtain data to be converted, and includes: acquiring user tone colors corresponding to the data to be processed, and carrying out primary splitting on the data to be processed according to the user tone colors to obtain data segments corresponding to each user tone color; acquiring a pause time threshold, and carrying out secondary splitting on the data segments corresponding to the tone colors of each user according to the pause time threshold to obtain each first sub-data segment; acquiring a pre-established dictionary, and fusing each first sub-data segment by using the dictionary to obtain a plurality of second sub-data segments; acquiring the starting time and the ending time of each second sub-data segment; marking each second sub-data segment according to the starting time and the ending time of each second sub-data segment to obtain the data to be converted;

8. A computer device, the computer device comprising:

a memory storing at least one instruction; and

A processor executing instructions stored in the memory to implement the speech conversion based event recording method of any one of claims 1 to 6.

9. A computer-readable storage medium, characterized by: the computer-readable storage medium having stored therein at least one instruction for execution by a processor in a computer device to implement the speech conversion based event recording method of any of claims 1 to 6.