CN116347182A

CN116347182A - Information processing method, intelligent terminal and storage medium

Info

Publication number: CN116347182A
Application number: CN202310450817.8A
Authority: CN
Inventors: 徐鑫
Original assignee: Shanghai Chuanying Information Technology Co Ltd
Current assignee: Shanghai Chuanying Information Technology Co Ltd
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-06-27

Abstract

The application provides an information processing method, an intelligent terminal and a storage medium, wherein the information processing method comprises the following steps: s10: acquiring audio stream data, and converting the audio stream data into target text data; s20: acquiring or determining a subtitle scene associated with the audio stream data; s30: and displaying the caption content corresponding to the target text data in the current interface according to the caption scene. According to the method and the device, the audio stream data are converted into the target text data through offline conversion, the display mode of the subtitles can be adjusted through different subtitle display scenes, and more rich subtitle function experience can be brought to users.

Description

Information processing method, intelligent terminal and storage medium

Technical Field

The application relates to the technical field of intelligent terminals, in particular to an information processing method, an intelligent terminal and a storage medium.

Background

The subtitle refers to non-visual contents such as conversations in television, movies, and stage works, which are displayed in the form of characters, and also refers to characters processed in the later stages of the movie works. When the voice in the film and television works is not clear enough or language barriers exist, the effect of helping the viewer understand the content of the works can be achieved. The caption display often needs to be made in advance, and as the caption demand is continuously expanded, AI (Artificial Intelligence ) caption technology is gradually rising.

In the process of designing and implementing the present application, the inventors found that at least the following problems exist: 1. the caption display style is single; 2. after the AI subtitle function is started, network support is needed, and the subtitle function or subtitle text cannot be used in a network-free or weak network environment, so that the accuracy is low, and the experience for a user is poor.

The foregoing description is provided for general background information and does not necessarily constitute prior art.

Disclosure of Invention

Aiming at the technical problems, the application provides an information processing method, an intelligent terminal and a storage medium, and aims to solve the problem of single subtitle display style.

In order to solve the above technical problems, the present application provides an information processing method, which includes:

s10: acquiring audio stream data, and converting the audio stream data into target text data;

s20: acquiring or determining a subtitle scene associated with the audio stream data;

s30: and displaying the caption content corresponding to the target text data in the current interface according to the caption scene.

Optionally, the step S10 includes:

acquiring or determining a sound source, and acquiring audio stream data of the sound source;

and identifying or acquiring the original language of the audio stream data, and converting the audio stream data into target text data corresponding to the original language.

Optionally, the step S10 includes:

and converting the audio stream data into target text data corresponding to the target language.

Optionally, the step S20 includes:

identifying an application associated with the audio stream data;

and determining the subtitle scene according to the attribute information of the application program.

Optionally, after the step S20, the method further includes:

acquiring or determining a preset word stock according to the subtitle scene;

and carrying out error correction processing on the target text data based on the preset word stock.

Optionally, the step S10 further includes:

and in response to acquiring or confirming the offline state, invoking an offline subtitle model to convert the audio stream data into target text data.

Optionally, the step S30 includes:

acquiring preset display information according to the caption scene and/or the play identification information of the audio stream data;

and displaying the caption content corresponding to the target text data in the current interface according to the preset display information.

Optionally, the preset display information includes at least one of a subtitle display position, a font of a subtitle, a color of the subtitle, and a subtitle hover frame.

The application also provides an intelligent terminal, including: the system comprises a memory and a processor, wherein the memory stores an information processing program, and the information processing program realizes the steps of the method when being executed by the processor.

The present application also provides a storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.

As described above, the information processing method of the present application includes the steps of: s10: acquiring audio stream data, and converting the audio stream data into target text data; s20: acquiring or determining a subtitle scene associated with the audio stream data; s30: and displaying the caption content corresponding to the target text data in the current interface according to the caption scene. Through the technical scheme, the display mode of the subtitles can be adjusted through different subtitle display scenes, so that richer subtitle function experience can be brought to the user.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic hardware structure diagram of an intelligent terminal implementing various embodiments of the present application;

fig. 2 is a schematic diagram of a communication network system according to an embodiment of the present application;

fig. 3 is a flow chart of an information processing method shown according to the first embodiment;

fig. 4 is a schematic diagram of a subtitle display interface of the information processing method according to the first embodiment;

fig. 5 is a schematic diagram of a music scene subtitle display according to the information processing method shown in the first embodiment;

fig. 6 is a schematic diagram of a talk scene subtitle display according to the information processing method shown in the first embodiment;

fig. 7 is a flow chart of an information processing method according to the second embodiment.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings. Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the element defined by the phrase "comprising one … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element, and furthermore, elements having the same name in different embodiments of the present application may have the same meaning or may have different meanings, a particular meaning of which is to be determined by its interpretation in this particular embodiment or by further combining the context of this particular embodiment.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope herein. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context. Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, steps, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, steps, operations, elements, components, items, categories, and/or groups. The terms "or," "and/or," "including at least one of," and the like, as used herein, may be construed as inclusive, or meaning any one or any combination. For example, "including at least one of: A. b, C "means" any one of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; a and B and C ", again as examples," A, B or C "or" A, B and/or C "means" any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; a and B and C). An exception to this definition will occur only when a combination of elements, functions, steps or operations are in some way inherently mutually exclusive.

It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the figures may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily occurring in sequence, but may be performed alternately or alternately with other steps or at least a portion of the other steps or stages.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should be noted that, in this document, step numbers such as S10 and S20 are adopted, and the purpose of the present invention is to more clearly and briefly describe the corresponding content, and not to constitute a substantial limitation on the sequence, and those skilled in the art may execute S20 first and then execute S10 when implementing the present invention, which is within the scope of protection of the present application.

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and are not of specific significance per se. Thus, "module," "component," or "unit" may be used in combination.

The intelligent terminal may be implemented in various forms. For example, the smart terminals described in the present application may include smart terminals such as cell phones, tablet computers, notebook computers, palm computers, personal digital assistants (Personal Digital Assistant, PDA), portable media players (Portable Media Player, PMP), navigation devices, wearable devices, smart bracelets, pedometers, and stationary terminals such as digital TVs, desktop computers, and the like.

The following description will be given taking a mobile terminal as an example, and those skilled in the art will understand that the configuration according to the embodiment of the present application can be applied to a fixed type terminal in addition to elements particularly used for a moving purpose.

Referring to fig. 1, which is a schematic hardware structure of a mobile terminal implementing various embodiments of the present application, the mobile terminal 100 may include: an RF (Radio Frequency) unit 101, a WiFi module 102, an audio output unit 103, an a/V (audio/video) input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, a processor 110, and a power supply 111. Those skilled in the art will appreciate that the mobile terminal structure shown in fig. 1 is not limiting of the mobile terminal and that the mobile terminal may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile terminal in detail with reference to fig. 1:

the radio frequency unit 101 may be used for receiving and transmitting signals during the information receiving or communication process, specifically, after receiving downlink information of the base station, processing the downlink information by the processor 110; and, the uplink data is transmitted to the base station. Typically, the radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, GSM (Global System of Mobile communication, global system for mobile communications), GPRS (General Packet Radio Service ), CDMA2000 (Code Division Multiple Access, 2000, CDMA 2000), WCDMA (Wideband Code Division Multiple Access ), TD-SCDMA (Time Division-Synchronous Code Division Multiple Access, time Division synchronous code Division multiple access), FDD-LTE (Frequency Division Duplexing-Long Term Evolution, frequency Division duplex long term evolution), TDD-LTE (Time Division Duplexing-Long Term Evolution, time Division duplex long term evolution), and 5G, among others.

WiFi belongs to a short-distance wireless transmission technology, and a mobile terminal can help a user to send and receive e-mails, browse web pages, access streaming media and the like through the WiFi module 102, so that wireless broadband Internet access is provided for the user. Although fig. 1 shows a WiFi module 102, it is understood that it does not belong to the necessary constitution of a mobile terminal, and can be omitted entirely as required within a range that does not change the essence of the invention.

The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the WiFi module 102 or stored in the memory 109 into an audio signal and output as sound when the mobile terminal 100 is in a call signal reception mode, a talk mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output unit 103 may also provide audio output (e.g., a call signal reception sound, a message reception sound, etc.) related to a specific function performed by the mobile terminal 100. The audio output unit 103 may include a speaker, a buzzer, and the like.

The a/V input unit 104 is used to receive an audio or video signal. The a/V input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, the graphics processor 1041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphics processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the WiFi module 102. The microphone 1042 can receive sound (audio data) via the microphone 1042 in a phone call mode, a recording mode, a voice recognition mode, and the like, and can process such sound into audio data. The processed audio (voice) data may be converted into a format output that can be transmitted to the mobile communication base station via the radio frequency unit 101 in the case of a telephone call mode. The microphone 1042 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting the audio signal.

The mobile terminal 100 also includes at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Optionally, the light sensor includes an ambient light sensor and a proximity sensor, optionally, the ambient light sensor may adjust the brightness of the display panel 1061 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1061 and/or the backlight when the mobile terminal 100 moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; as for other sensors such as fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured in the mobile phone, the detailed description thereof will be omitted.

The display unit 106 is used to display information input by a user or information provided to the user. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 107 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the mobile terminal. Alternatively, the user input unit 107 may include a touch panel 1071 and other input devices 1072. The touch panel 1071, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 1071 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a predetermined program. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Optionally, the touch detection device detects the touch azimuth of the user, detects a signal brought by touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 110, and can receive and execute commands sent from the processor 110. Further, the touch panel 1071 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 107 may include other input devices 1072 in addition to the touch panel 1071. Alternatively, other input devices 1072 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc., as specifically not limited herein.

Alternatively, the touch panel 1071 may overlay the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or thereabout, the touch panel 1071 is transferred to the processor 110 to determine the type of touch event, and the processor 110 then provides a corresponding visual output on the display panel 1061 according to the type of touch event. Although in fig. 1, the touch panel 1071 and the display panel 1061 are two independent components for implementing the input and output functions of the mobile terminal, in some embodiments, the touch panel 1071 may be integrated with the display panel 1061 to implement the input and output functions of the mobile terminal, which is not limited herein.

The interface unit 108 serves as an interface through which at least one external device can be connected with the mobile terminal 100. For example, the external devices may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the mobile terminal 100 or may be used to transmit data between the mobile terminal 100 and an external device.

Memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a storage program area and a storage data area, and alternatively, the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 109 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 110 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by running or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the mobile terminal. Processor 110 may include one or more processing units; preferably, the processor 110 may integrate an application processor and a modem processor, the application processor optionally handling mainly an operating system, a user interface, an application program, etc., the modem processor handling mainly wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The mobile terminal 100 may further include a power source 111 (e.g., a battery) for supplying power to the respective components, and preferably, the power source 111 may be logically connected to the processor 110 through a power management system, so as to perform functions of managing charging, discharging, and power consumption management through the power management system.

Although not shown in fig. 1, the mobile terminal 100 may further include a bluetooth module or the like, which is not described herein.

In order to facilitate understanding of the embodiments of the present application, a communication network system on which the mobile terminal of the present application is based will be described below.

Referring to fig. 2, fig. 2 is a schematic diagram of a communication network system provided in the embodiment of the present application, where the communication network system is an LTE system of a general mobile communication technology, and the LTE system includes a UE (User Equipment) 201, an e-UTRAN (Evolved UMTS Terrestrial Radio Access Network ) 202, an epc (Evolved Packet Core, evolved packet core) 203, and an IP service 204 of an operator that are sequentially connected in communication.

Alternatively, the UE201 may be the terminal 100 described above, which is not described here again.

The E-UTRAN202 includes eNodeB2021 and other eNodeB2022, etc. Alternatively, the eNodeB2021 may connect with other enodebs 2022 over a backhaul (e.g., X2 interface), the eNodeB2021 is connected to the EPC203, and the eNodeB2021 may provide access for the UE201 to the EPC 203.

EPC203 may include MME (Mobility Management Entity ) 2031, hss (Home Subscriber Server, home subscriber server) 2032, other MMEs 2033, SGW (Serving Gate Way) 2034, pgw (PDN Gate Way) 2035 and PCRF (Policy and Charging Rules Function, policy and tariff function entity) 2036, and so on. Optionally, MME2031 is a control node that handles signaling between UE201 and EPC203, providing bearer and connection management. HSS2032 is used to provide registers to manage functions such as home location registers (not shown) and to hold user specific information about service characteristics, data rates, etc. All user data may be sent through SGW2034 and PGW2035 may provide IP address allocation and other functions for UE201, PCRF2036 is a policy and charging control policy decision point for traffic data flows and IP bearer resources, which selects and provides available policy and charging control decisions for a policy and charging enforcement function (not shown).

IP services 204 may include the internet, intranets, IMS (IP Multimedia Subsystem ), or other IP services, etc.

Although the LTE system is described above as an example, it should be understood by those skilled in the art that the present application is not limited to LTE systems, but may be applied to other wireless communication systems, such as GSM, CDMA2000, WCDMA, TD-SCDMA, and future new network systems (e.g., 5G), etc.

Based on the above-mentioned mobile terminal hardware structure and communication network system, various embodiments of the present application are presented.

First embodiment

The embodiment of the application provides an information processing method, as shown in fig. 3, the method includes:

s10, step: acquiring audio stream data, and converting the audio stream data into target text data;

the execution body of the embodiment may be an intelligent terminal. The intelligent terminal can comprise a subtitle function, after the user selects to start the subtitle function, the intelligent terminal can monitor whether the audio stream data is identified, and after the audio stream data is monitored, the audio stream data is converted into target text data in an offline mode. The manner of acquiring or determining the audio stream data is not limited in this embodiment, for example, the audio stream data may be generated by the intelligent terminal itself or may be externally input to the intelligent terminal.

Optionally, in some possible embodiments, step S10 may include:

Step a, acquiring or determining a sound source, and acquiring audio stream data of the sound source;

the sound source may be understood as a source from which audio stream data is output, and a user may select the sound source of the audio stream data in a setting interface of the subtitle function. The sound source may be an internal source or an external source. An internal source such as a media tone may be played by the intelligent terminal as an audio video, a ring tone, a reminder tone, or a game output. An external source such as a microphone can pick up external audio within its sound reception range, so that the intelligent terminal can receive externally input audio stream data.

After the user selects the open caption function, the language type and the sound source type of the voice assistant can be selected in the interface of the intelligent terminal. The intelligent terminal acquires the audio stream data of the corresponding sound source by acquiring the sound source setting information of the subtitle function to acquire the current sound source type. The mode of collecting the audio stream data can be recording audio through the recording function of the intelligent terminal.

And b, identifying or acquiring the original language of the audio stream data, and converting the audio stream data into target text data corresponding to the original language.

Optionally, in response to the current intelligent terminal being in an offline state, invoking an offline subtitle model to convert the audio stream data into target text data.

The intelligent terminal can pre-store the offline caption model, transmit the collected audio stream data into the offline caption model, and output a text result, namely target text data. The offline subtitle model may be an ASR (Automatic Speech Recognition ) model that may encode and decode input audio stream data, outputting target text data in text form. The ASR model may include a preset language model, such as an english model and a french model, and is adapted to subtitle generating requirements of audio stream data in different regions and different language types. The language model can follow the system language setting in the intelligent terminal, and can also be set by a user on a setting interface of the subtitle function. In the process of analyzing the audio stream data by the ASR model, the feature extraction can be firstly carried out on the audio stream data, the extracted feature vector is matched with the data in the ASR model, the text content is output, and the language of the text can be consistent with the original language of the audio stream data.

Optionally, in some possible embodiments, step S10 may further include:

step c, acquiring or determining a sound source, and acquiring audio stream data of the sound source;

And d, converting the audio stream data into target text data corresponding to the target language.

The caption function can also provide caption translation service for users to help users understand audio and video content when listening to the audio of unskilled mastered language. The user can set the language conversion type of the subtitle through the language setting information at the subtitle function setting interface. The language setting information may include an initial language type before conversion and a target language after conversion.

A translation service can be preset in a system of the intelligent terminal, the acquired audio stream data is converted from an initial language type into a target language through the translation service, and target text data in a target language format is output. The translation service may be integrated in an offline subtitle model, and output text content corresponding to the target text data in an offline state for display in a hover frame.

S20, step: acquiring or determining a subtitle scene associated with the audio stream data;

the caption scene may be regarded as an application scene of the caption, and the display effect of the caption may be different in different application scenes. There may be a difference between application scenes associated with different types of audio stream data, and the present embodiment does not limit a manner of acquiring or determining a subtitle scene, for example, the subtitle scene may be selected according to a sound source type of the audio stream data.

In some possible embodiments, step S20 may include:

step e, identifying or acquiring an application program associated with the audio stream data;

various types of application programs can be installed in the intelligent terminal, and the generation of the audio stream data is associated with the application programs, for example, when a user watches video by using a video application in the intelligent terminal, the video application outputs audio to a sound source to generate the audio stream data. The subtitle scenes may be classified by an application associated with the audio stream data. When the application program associated with the audio stream data is identified, the application program name occupying the sound source can be obtained by analyzing the occupation condition of the sound source, and the attribute information of the application program is obtained according to the application program name.

And f, determining the caption scene according to the attribute information of the application program.

The attribute information may include application type information and application identification information. The application type information may represent a usage type of the application program, such as a music type and a video type. The application identification information may represent a type of function of the application program in the intelligent terminal, such as a call and a recording. The audio stream data may be divided into different subtitle scenes such as a music scene, a video scene, and a call scene by attribute information.

S30, step: and displaying the caption content corresponding to the target text data in the current interface according to the caption scene.

The intelligent terminal can display a default hover frame in the current interface when recognizing that the user starts the subtitle function, and can also display the hover frame and the text content in the interface together when converting and generating the target text data. Optionally, after the user turns on the subtitle function, a subtitle icon may be displayed in the interface to prompt the user that the subtitle is currently in an on state. Fig. 4 is an interface diagram showing a hover frame and a subtitle, and as shown in fig. 4, when a user views a movie, a subtitle function may be turned on, a hover frame 302 and a subtitle 303 are displayed in a current interface 301 of the intelligent terminal, and the subtitle content follows an audio update in the movie.

In some possible embodiments, step S30 may further include:

step g, obtaining preset display information according to the play identification information of the caption scene and/or the audio stream data;

the subtitle scene can be used for more broadly classifying the audio stream data, and the playing identification information can be used for more accurately distinguishing the audio stream data. The audio stream data can be represented in various forms such as movies, television shows, music, audio books and the like, and the playing identification information can be characteristic information such as names, transmission numbers, version numbers and the like of the audio stream data in the representation forms. The play identification information can be obtained in the process of obtaining the audio stream data, or can be obtained by inquiring an application program associated with the audio stream data. For audio stream data having specific play identification information, the play identification information thereof may be recorded, and preset display information may be set accordingly. The subtitle scene of each category may also be provided with preset display information corresponding to the category.

The preset display information may include at least one of a subtitle display position, a font of a subtitle, a color of the subtitle, and a subtitle hover frame. The intelligent terminal can acquire preset display information associated with the caption scene and/or the play identification information after identifying the category of the caption scene and/or the play identification information of the audio stream data, and set the caption display position, the font of the caption, the color of the caption and the caption suspension frame in the interface of the intelligent terminal through the preset display information and display the caption display position, the font of the caption, the color of the caption and the caption suspension frame in the interface.

The display position of the caption content in the current interface can be set through the caption display position, and the size, shape and transparency of the suspension frame can be set through the caption suspension frame. The display position of the caption, the font of the caption and the color of the caption can be set up by the appearance of the caption, the caption suspension frame can be regarded as an interface control which is displayed in cooperation with the caption, and the appearance of the caption suspension frame can also be set up. For different subtitle scenes and/or play identification information, default hover frame appearances and default subtitle appearances can be preset, and after the subtitle scenes and/or play identification information is determined, default hover frame appearances and default subtitle appearances corresponding to the subtitle scenes and/or play identification information are displayed. The suspension frame in the current interface of the intelligent terminal can comprise a control with appearance setting, and a user can adjust the appearance of the suspension frame and the caption through the control.

And displaying the suspension frame and the caption style with specific display styles according to the play identification information of the audio stream data, enhancing the appearance of the suspension frame and the caption and the adaptation degree of the audio stream data in the current interface, and bringing more immersive viewing experience for users.

And h, displaying the caption content corresponding to the target text data in the current interface according to the preset display information.

The default floating frame can be displayed according to various setting values in the preset display information, and the text of the target text data is filled into the floating frame according to the default subtitle appearance, so that the adaptation degree of the subtitle and the intelligent terminal play content is enhanced, and immersive experience is brought to the user. When the floating frame is displayed in the interface, the user can also perform operations such as clicking, double clicking, long pressing, dragging and the like on the floating frame, and the size and the position of the floating frame are changed. Alternatively, when the caption content is too much to be displayed in the floating frame completely, the caption may be displayed in a scrolling manner.

For example, in a video scene, the floating frame is displayed at the position, close to the bottom, of the current interface by default, the text size is set moderately, shielding of a video picture is avoided, the viewing experience of a user is affected, and the display scene is shown in fig. 4. Fig. 5 is a schematic diagram of a subtitle display scene in a music scene, where a hover frame 302 may be displayed near the top position in the current interface 301, and the text size of the subtitle 303 is set to be moderate, so as to avoid shielding the application icon 304, as shown in fig. 5. Fig. 6 is a schematic diagram of a subtitle display scene in a call scene, as shown in fig. 6, in the call scene, a hover frame 302 may be displayed in an upper position in the middle of the current interface 301, where the hover frame 302 is set to be larger, and the text size of the subtitle 303 is set to be moderate, so that more call content is displayed in the hover frame 302, and no shielding is formed for the function keys 305 in the call interface.

In this embodiment, the information processing method of the present application includes the steps of: s10: acquiring or determining audio stream data, and converting the audio stream data into target text data; s20: acquiring or determining a subtitle scene associated with the audio stream data; s30: and displaying the caption content corresponding to the target text data in the current interface according to the caption scene. Through the technical scheme, the audio can be displayed in the current interface in the form of the subtitles and the floating frame through offline conversion, the problem that the subtitles cannot be displayed in the network-free or weak network environment is solved, the applicability of the subtitle function is enhanced, the user experience is further improved, the subtitle display mode is related to the subtitle scene, and richer watching experience can be brought to the user.

Second embodiment

Based on the above-described first embodiment, a second embodiment of the information processing method of the present application is proposed, and as shown in fig. 7, the information processing method of the present application may further include, after the step S20 described above:

step S21, acquiring or determining a preset word stock according to the caption scene;

the execution body of the embodiment may be an intelligent terminal. The preset word library can contain keywords of different types of audio sources such as films, videos, songs, sound sources and the like, and the keywords can be names of people, names of places, lines and the like. The intelligent terminal can store a preset word stock, and the category of the preset word stock can be matched with the category of the subtitle scene, for example, the movie word stock is matched with the movie scene. The method for obtaining or determining the preset word stock is not limited in this embodiment, for example, the preset word stock of the corresponding type may be searched according to the type of the subtitle scene, and the keywords in the word stock may be obtained.

And S22, performing error correction processing on the target text data based on the preset word stock.

In the current audio-to-text processing, the accuracy of the correspondence between text and audio is often not fully accurate, i.e., errors in the correspondence of text may occur. The error correction processing refers to processing of correcting an error occurring in a text to a correct text. The method of error correction processing is not limited in this embodiment, for example, homonym matching and paranym matching can be performed on the converted target text data and keywords in a preset word stock, words in the target text data are replaced by hit keywords with matching similarity exceeding a similarity threshold, and corrected text is formed and displayed in an interface of the intelligent terminal.

In some possible embodiments, the step of performing error correction processing on the target text data based on the preset word stock may include:

step i, importing the preset word stock into a preset language processing model;

the preset language processing model can be an NLP (Natural Language Processing ) model, and is stored in the intelligent terminal, and text correction under the audio stream data association scene can be performed by combining the imported preset word stock.

And j, calling the preset language processing model to perform error correction processing on the target text data.

In the actual application process, the preset language processing model can perform lexical analysis, syntactic analysis or semantic analysis on the input text, and output the text subjected to error correction under the general condition, but the error correction effect is poor for errors in certain special situations. After the preset word stock is imported into a preset language processing model, candidate words can be generated by searching homonyms or orthonyms of the error words after the error words at the error positions in the target text data are detected, probability ranking is carried out on the candidate words, the candidate words with the highest probability in the ranking result are used as correct words after error correction, the error words in the text are replaced, and the text data after error correction are generated. In addition, the input object of the error correction processing may be target text data corresponding to the original language or target text data corresponding to the target language.

In addition, the preset word stock and the preset language processing model stored in the intelligent terminal can be operated under the off-line condition, and the on-line updating of the preset word stock and the preset language processing model can be performed under the condition that the network state is good and the network state is in the preset updating time period.

By detecting the network state of the intelligent terminal, the subtitle server can be connected under the condition that the network state is good and the preset updating time period is reached, and the preset language processing model and the preset word stock are updated. The preset update time period may be set in a setting interface of the subtitle function, or may be a default value.

The updating of the preset word stock can be embodied in that the heat degree words exceeding the preset heat degree value in the preset past time period are acquired first, or the heat degree words with the heat degree increased beyond the preset heat degree increased are acquired first, and the heat degree words are added into the preset word stock. The updating of the preset language processing model may be embodied as sending an update request to the subtitle server, obtaining model update parameters, and updating the preset language processing model according to the model update parameters. The subtitle server can also comprise a language processing model, and can provide online subtitle generation support for the intelligent terminal. In addition, the intelligent terminal can be connected with the subtitle server to update the offline subtitle model.

In this embodiment, the preset language processing model is combined with the preset word stock, so that scene correction can be performed on the target text data, the subtitle content corresponding to the corrected text data can more accurately convey the information of the audio stream data to the user, and the preset word stock and the preset language processing model can be updated under the condition of meeting the preset condition, so that the accuracy of the subtitle content is further improved.

Example III

The present application also provides an information processing apparatus including:

the conversion module is used for acquiring or determining audio stream data and converting the audio stream data into target text data;

a determining module, configured to acquire or determine a subtitle scene associated with the audio stream data;

and the display module is used for displaying the caption content corresponding to the target text data in the current interface according to the caption scene.

Optionally, the conversion module is further configured to:

acquiring or determining a sound source currently associated with a subtitle function, and acquiring audio stream data of the sound source;

Optionally, the conversion module is further configured to:

and calling a translation service to convert the audio stream data into target text data corresponding to the target language.

Optionally, the determining module is further configured to:

identifying an application program associated with the audio stream data, and acquiring attribute information of the application program;

And determining the subtitle scene according to the attribute information.

Optionally, the information processing apparatus further includes an error correction module configured to:

acquiring or determining a preset word stock according to the subtitle scene;

Optionally, the error correction module is further configured to:

importing the preset word stock into a preset language processing model;

and calling the preset language processing model to perform error correction processing on the target text data.

Optionally, the display module is further configured to:

the preset display information comprises at least one of caption display positions, fonts of captions, colors of the captions and caption suspension frames.

The application also provides an intelligent terminal, which comprises a memory and a processor, wherein the memory stores an information processing program, and the information processing program is executed by the processor to realize the steps of the information processing method in any embodiment.

The present application also provides a storage medium having stored thereon an information processing program which, when executed by a processor, implements the steps of the information processing method in any of the above embodiments.

The embodiments of the intelligent terminal and the storage medium provided in the present application may include all technical features of any one of the embodiments of the information processing method, and the expansion and explanation contents of the description are substantially the same as those of each embodiment of the method, which are not repeated herein.

The present embodiments also provide a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method in the various possible implementations as above.

The embodiments also provide a chip including a memory for storing a computer program and a processor for calling and running the computer program from the memory, so that a device on which the chip is mounted performs the method in the above possible embodiments.

It can be understood that the above scenario is merely an example, and does not constitute a limitation on the application scenario of the technical solution provided in the embodiments of the present application, and the technical solution of the present application may also be applied to other scenarios. For example, as one of ordinary skill in the art can know, with the evolution of the system architecture and the appearance of new service scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The units in the device of the embodiment of the application can be combined, divided and pruned according to actual needs.

In this application, the same or similar term concept, technical solution, and/or application scenario description will generally be described in detail only when first appearing, and when repeated later, for brevity, will not generally be repeated, and when understanding the content of the technical solution of the present application, etc., reference may be made to the previous related detailed description thereof for the same or similar term concept, technical solution, and/or application scenario description, etc., which are not described in detail later.

In this application, the descriptions of the embodiments are focused on, and the details or descriptions of one embodiment may be found in the related descriptions of other embodiments.

The technical features of the technical solutions of the present application may be arbitrarily combined, and for brevity of description, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, a controlled terminal, or a network device, etc.) to perform the method of each embodiment of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in a computer storage medium or transmitted from one computer storage medium to another computer storage medium, for example, from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). Computer storage media may be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc. that contain an integration of one or more of the available media. Usable media may be magnetic media (e.g., floppy disks, storage disks, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid State Disk (SSD)), among others.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. An information processing method, characterized in that the information processing method comprises the steps of:

2. The information processing method according to claim 1, wherein the S10 step includes:

3. The information processing method according to claim 1, wherein the S10 step includes:

4. The information processing method according to any one of claims 1 to 3, characterized in that the S20 step includes:

identifying or acquiring an application associated with the audio stream data;

5. The information processing method according to any one of claims 1 to 3, characterized by further comprising, after the step S20:

acquiring or determining a preset word stock according to the subtitle scene;

6. The information processing method according to any one of claims 1 to 3, characterized in that the S10 step further comprises:

7. The information processing method according to any one of claims 1 to 3, characterized in that the S30 step includes:

8. The information processing method according to claim 7, wherein the preset display information includes at least one of a subtitle display position, a font of a subtitle, a color of a subtitle, and a subtitle hover frame.

9. An intelligent terminal, characterized in that, the intelligent terminal includes: a memory, a processor, wherein the memory has stored thereon an information processing program which, when executed by the processor, implements the steps of the information processing method according to any one of claims 1 to 8.

10. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the information processing method according to any one of claims 1 to 8.