EP3534362A1

EP3534362A1 - Methods and apparatus for outputting audio

Info

Publication number: EP3534362A1
Application number: EP18159260.1A
Authority: EP
Inventors: Ulas YÜKSEL; Murat Dogan
Original assignee: Vestel Elektronik Sanayi ve Ticaret AS
Current assignee: Vestel Elektronik Sanayi ve Ticaret AS
Priority date: 2018-02-28
Filing date: 2018-02-28
Publication date: 2019-09-04

Abstract

In one aspect, a method of outputting audio comprises an electronic device (200) sending a request to a display device (100) for the display device (100) to transmit display text to the electronic device (200). The electronic device (200) receives display text from the display device (100), causes the received display text to be converted to corresponding audio; and sends the audio to a loudspeaker for output to a user (400).

Description

Technical Field

The present disclosure relates to methods and apparatus for outputting audio.

Background

Speech synthesis is the process of generating audio which artificially mimics human speech. This can be performed using specialised hardware, but is nowadays commonly performed by software configured to generate an audio waveform for output by a speaker.
A text-to-speech (TTS) system takes a text input and uses speech synthesis techniques to generate corresponding audio which a human listener would interpret as substantially the same as if a human reader had read the text out loud.
A typical TTS system comprises a front-end and a back-end. The front-end prepares the text input for synthesis into speech by the back-end. To do so, the front-end may perform various steps such as text normalization, text-to-phoneme conversion, dividing the text into prosodic units, etc. as known in the art.

Summary

According to a first aspect disclosed herein, there is provided a method of outputting audio comprising: an electronic device sending a request to a display device for the display device to transmit display text to the electronic device; the electronic device receiving display text from the display device; the electronic device causing the received display text to be converted to corresponding audio; and the electronic device sending the audio to a loudspeaker for output to a user.
In an example, causing the received display text to be converted to corresponding audio comprises applying a text-to-speech function to the received display text at the electronic device to generate the corresponding audio.
In an example, causing the received display text to be converted to corresponding audio comprises sending the received display text to an external text-to-speech function and receiving the corresponding audio from the external text-to-speech function.
In an example, the received display text comprises at least one of: a specific public broadcast item, a menu item, a subtitle, and Electronic Program Guide content.
In an example, the display text is text that is currently being displayed or about to be displayed by the display device.
In an example, the electronic device is a mobile device.
In an example, the request is for the display device to transmit multiple instances of display text to the electronic device.
According to another aspect disclosed herein, there is provided a method for enabling display text to be presented by an electronic device as audio, the method comprising: a display device receiving a request from an electronic device for the display device to transmit display text to the electronic device; and the display device sending the display text to the electronic device such that the electronic device can cause the received display text to be converted to corresponding audio and send the audio to a loudspeaker for output to a user.
In an example, the method comprises the display device sending multiple instances of display text to the electronic device.
In an example, each instance of display text is a display text currently displayed or about to be displayed by the display device.
According to another aspect disclosed herein, there is provided a computer program comprising instructions such that when the computer program is executed on an electronic device, the electronic device is arranged to: send a request to a display device for the display device to transmit display text to the electronic device; receive display text from the display device; cause the received display text to be converted to corresponding audio; and send the audio to a loudspeaker for output to a user.
According to another aspect disclosed herein, there is provided a computer program comprising instructions such that when the computer program is executed on a display device, the display device is arranged to: receive a request from an electronic device for the display device to transmit display text to the electronic device; and send the display text to the electronic device such that the electronic device can cause the received display text to be converted to corresponding audio and send the audio to a loudspeaker for output to a user.
There may be provided a non-transitory computer-readable storage medium storing a computer program as described above.
There may also be provided a method of outputting audio comprising: an electronic device sending a request to a display device for the display device to transmit display text to the electronic device; the display device receiving the request from an electronic device for the display device to transmit display text to the electronic device and sending the display text to the display device in response thereto; the electronic device receiving display text from the display device; the electronic device causing the received display text to be converted to corresponding audio; and the electronic device sending the audio to a loudspeaker for output to a user.
According to another aspect disclosed herein, there is provided an electronic device, the electronic device being constructed and arranged to: send a request to a display device for the display device to transmit display text to the electronic device; cause display text received from the display device to be converted to corresponding audio; and send the audio to a loudspeaker for output to a user.
In an example, the electronic device is configured to cause the received display text to be converted to corresponding audio by applying a text-to-speech function to the received display text at the electronic device to generate the corresponding audio.
According to another aspect disclosed herein, there is provided a display device, the display device being constructed and arranged to: receive a request from an electronic device for the display device to transmit display text to the electronic device; and send the display text to the electronic device such that the electronic device can cause the received display text to be converted to corresponding audio and send the audio to a loudspeaker for output to a user.

Brief Description of the Drawings

To assist understanding of the present disclosure and to show how embodiments may be put into effect, reference is made by way of example to the accompanying drawings in which:

Figure 1 shows schematically an example system comprising a display device and an electronic device;
Figure 2 shows schematically another example system comprising a display device and an electronic device; and
Figure 3 shows a flowchart of an example method performed by the display device and electronic device.

Detailed Description

Figure 1 shows schematically a system comprising a display device 100 and an electronic device 200 in accordance with examples described herein. The display device 100 is operatively coupled to the electronic device 200 via a wireless connection such as a Bluetooth connection, WiFi connection, etc. A wired connection could also be used instead or in addition. The electronic device 200 may be operatively coupled to one or more external devices 401, 402 as described in more detail below. Data communication technologies such as Bluetooth and WiFi are well-known in the art and so not described in detail herein. The type of connection between the display device 100 and the electronic device 200 may be different from the type of connection between the electronic device 200 and the one or more external device 401, 402 (which may themselves be different from each other).
The display device 100 is constructed and arranged to display text which can be seen by people in view of the display device 100. Figure 1 shows a user 400 in the vicinity of the display device 100. Suitable display technologies by which the display device 100 may display text include, but are not limited to, video displays such as an LCD (liquid crystal device), LED (light emitting diode), OLED (organic light emitting diode), etc., and non-video displays such as a split-flap display, dot-matrix display, flip-dot display, etc.
In this example, the display device 100 is a television set such as an LCD TV. Other examples of suitable display devices include computer monitors, cinema screens, public or private signage (which often use LED displays, in which LEDs generate the image directly, i.e. no backlight is used), dot matrix displays, etc. The display device 100 itself may or may not be capable of outputting audio, as discussed in more detail below.
The electronic device 200 is operated by the user 400. In this example, the electronic device 200 is a mobile device of the user 400 such as a smartphone, PDA (personal digital assistant), etc. Other electronic devices may be used, such as a laptop computer. In any case, the user 400 is able to use the electronic device 200 to listen to audio. This may comprise the electronic device 200 outputting audio via an internal loudspeaker of the electronic device 200. Alternatively or additionally, the electronic device 200 may cause the audio to be output by one or more external devices. Two examples of external devices, an external loudspeaker 401 and headphones 402, are shown in Figure 1. The electronic device 200 may be configured to transmit the audio to an external device via a wireless connection such as a Bluetooth or WiFi connection. A wired connection could also be used.
The user 400 may be visually-impaired, meaning that the user 400 may find the text displayed by the display device 100 difficult to read. For example, the display device 100 may be a television or cinema screen displaying a film or movie with subtitles. As another example, the display device 100 may be a sign in a public place such as a dot-matrix display or an LED display or "LED wall" at a train station displaying train time information. In both these cases, being unable to read the text easily may be a problem for the user 400 (the user 400 may not follow the film, or may miss the intended train).
In accordance with examples described herein, the electronic device 200 may send a request to the display device 100 for the text that the display device 100 is currently displaying or is about to display. The display device 100 may then respond by sending the text it is displaying or is about to display to the electronic device 200. In some examples, the display device 100 may be configured to stream the text it is displaying to the electronic device 200. This may be performed in response to an initial request for the text from the electronic device 200. This has the advantage that the electronic device 200 is kept up-to-date with the currently displayed text, as the actual text which is displayed by the display device 100 may change from one moment to the next.
After receiving the text, the electronic device 200 then converts the text into corresponding audio, e.g. using a text-to-speech engine, and outputs the audio to the user 400. The electronic device 200 may perform the conversion from text to audio itself and/or may outsource it to an external service, e.g. running on the "cloud". This is explained in more detail below.
The provided text may be text which is part of a normal visual item displayed by the display device (e.g. subtitles which are part of a film), or may be other textual content that can be presented by display devices such as menu items while the user is browsing the user interface, Electronic Program Guide content or any other text data available on the display device 100. In any case, this enables an auditory description of visually presented text to be provided locally to the user. This is of particular use in making visually impaired users more aware of the displayed content. A television system, for example, can thereby ensure that visually impaired users get enough auditory guidance while navigating through the user interface dialogues and menus, which may be quite small when presented on the display device 100.
Another advantage of this is that the display device 100 does not require audio any playback capability since audio playback is handled by electronic device 200. This is common for some types of display device such as public displays, e.g. at hospitals, stations, airports, etc.
Further, in the case that the display device 100 can play audio, the display device 100 does not mix the synthesised speech (e.g. synthesised subtitles, menu items, etc.) with main audio (e.g. the actual audio of the film) as the synthesised speech is only played back locally by or under control of the electronic device 200. This means that the text-to-speech service can be provided only to visually impaired people, or those who want the text-to-speech service for other reasons, without disturbing other users.
Figure 2 shows schematically the display device 100, electronic device 200, and additional elements in more detail. Specifically, Figure 2 illustrates various hardware and software components with which the display device 100 and electronic device 200 may be configured.
In the example of Figure 2, the display device 100 comprises a processor 110, a data storage 120, an input unit 131, a network unit 132, a display unit 133, and an audio unit 134. It is understood that the components shown in Figure 2 are for illustration only and that the display device 100 and electronic device 200 may be configured with more or fewer components for performing various different functionalities. In a specific example, the display device 100 does not comprise the audio unit 134. That is, the display device 100 may not be able to output audio: it may be purely a display device. In another example, the display device 100 does not comprise the input unit 131. That is, the display device 100 may not be able to receive user input, meaning that the user 400 cannot control what text is displayed (such as the case with a public sign, e.g. a railway station departure board, or a film displayed on a cinema screen).
The input unit 131, when present, is configured to receive user input, e.g. from a remote control 101. The input unit 131 is controlled by an input controller 114 running on the processor 110.
The network unit 132 is configured to transmit and receive data, e.g. to and from the electronic device 200 as mentioned above. The network unit 132 may additionally be configured to receive data from one or more external data sources such as a content broadcaster 102 (e.g. a Digital Video Broadcast) or content streamer 103 as shown in Figure 2. The network unit 132 is controlled by a network controller 115 running on the processor 110.
The display unit 133 is configured to present visual content on the display device 100 via, e.g. one or more of the display technologies mentioned above. The display unit 133 is controlled by a display controller 116 running on the processor 110.
The audio unit 134 is, for example, a loudspeaker. When present, the audio unit 134 is configured to output audio to be heard by users within earshot of the display device 100.
The data storage 120 is an electronic storage. The data storage 120 may be configured to stored data for use by the processor 110. In this example, the data storage 120 stores content 121 to be displayed by the display device 100 and one or more user preferences 122. The data storage 120 may also store configuration data for each of the input unit 131, network unit 132, display unit 133, and audio unit 134 described above.
The processor 110 may also run other modules such as content renderers 111 and a user-interface (UI) as shown in Figure 2.
Also shown in more detail in Figure 2 is the electronic device 200. The electronic device 200 comprises at least a network unit, audio unit, processor unit, and a data storage. The data storage is configured to maintain one or more modules that are executable on the processor. Four modules are shown in Figure 2: a network controller 201, text subscriber 202, text-to-speech (TTS) engine 203, and audio controller 204.
The network controller 201, along with the network unit (not shown in Figure 2), is configured to manage a data connection between the electronic device 200 and the display device 100, i.e. to at least receive data from the display device 100 and optionally transmit data to the display device 100. For example, the network controller 201 may provide a wireless connection to the display device 100 using, e.g. a Bluetooth or WiFi connection. In particular, the network controller 201 is configured to receive data from the display device 100 including text which is currently, or about to be, displayed by the display device 100. In other examples, the network controller 201 may be configured to receive data from one or more of the external data sources such as a content broadcaster 102 (e.g. a Digital Video Broadcast) or content streamer 103.
The text subscriber 202 is configured to discover the display device 100 and subscribe to text notifications from the display device 100. This is explained in more detail below.
The TTS engine 203 is configured to retrieve the text and convert it to synthesised speech data via an embedded or text-to-speech engine or cloud text-to-speech service. The TTS engine 203 itself may perform a full speech synthesis process on the text (the front-end and back-end). Alternatively, the text may have been pre-processed by the display device 100 by applying front-end processes (e.g. normalization, text-to-phoneme conversion, dividing the text into prosodic units, etc.) at the display device 100 before transmission of the (pre-processed) text to the electronic device 200. In such cases, the TTS engine 203 need only perform the back-end task of generating the audio from the pro-processed text.
The audio controller 204 is configured to cause the synthesised speech data to be output to the user 400. As mentioned above, this may be via one or more internal speakers of the electronic device 200, an external speaker, headphones, etc. The choice of output device may be specified in the user preferences 122 stored on the display device 100 or user preferences stored on the electronic device 200.
Figure 3 shows a flow chart illustrating an example method performed by the system described above. Steps relating to the display device 100 displaying text and/or images are not shown as these are well known in the art. For example, the display device 100 may retrieve text or image content 121, from for example internal data storage 120 or external data storage such as a file server or DVD (Digital Versatile Disk) or Blu Ray player or the like, and display the content 121 on a screen. As another example, the display device 100 may receive text or image content from a remote external source (such as a content broadcaster 102 or content streamer 103), e.g. via a terrestrial, satellite or cable transmission, or over the Internet, and display the content 121 on a screen.
Also not shown in Figure 3 are steps relating to establishing a data connection between the electronic device 200 and the display device 100, as these are also well known in the art.
At S300 in Figure 3, the electronic device 200 requests text from the display device 100. This may be performed in response to specific input from the user 400. For example, the user 400 may determine that he or she is unable to read the text displayed on the display device 100 and so provides user input to the electronic device (e.g. via a graphical user interface, keyboard, etc.) causing the electronic device 200 to send the request. Alternatively or additionally, this may be performed without explicit input from the user 400. For example, the electronic device 200 may automatically request text from the display device 100 in response to connecting to the display device 100.
The request may be for specific text or may be more general (e.g. for "all" text currently displayed). In a specific example, the display device 100 is a train departure board at a station. In such an example, the electronic device 200 may request departure and/or journey details of a specific train, or may request the entirety of the text displayed on the display device 100. Hence, it is understood that the text provided to the electronic device 200 (see S301 below) may or may not correspond exactly to the text displayed on the display device 100. In some examples, the electronic device 200 may also be configured to act as a remote control (like the remote controller 101 shown in Figure 2). When used as a remote control, the electronic device 200 may be used by the user 400 to navigate a menu on the display device 100 (e.g. by moving a cursor or highlighted text, selecting menu items, etc.) In these cases, the request for the text may comprise a request for currently selected or highlighted text etc.
As another example, the display device 100, especially for example a public display device ("signage"), may be arranged such that certain text or certain classes of text are always sent to electronic devices 200 that have connected to or subscribed to the display device 100 for receiving such text. In particular, certain text or certain classes of text may be flagged at or by the display device 100 such that the text is always sent to connected electronic devices 200. This flagging may be carried out effectively by some operator of the display device 100 when instructions to display the text are given to the display device 100. For example, there may be specific public broadcast items that are presented as text on the display device 100 which are of general importance. As specific examples, there may be public broadcast items that indicate travel delays, forecasts of dangerous weather conditions, etc. As particular examples to illustrate this, the flagged text might be of the type "signalling problems are causing delays to all trains"; "all flights are delayed because of weather problems", etc. The display device 100 may be arranged such that text that has been flagged in this way is always sent to electronic devices 200 that have connected to or subscribed to the display device 100 for receiving such flagged text.
In S301, the display device 100 provides corresponding text in response to receiving the request or at least following receipt of a request. The display device 100 may have access to the text in the form of a text file, e.g. a standalone subtitles file of the content 121. Alternatively, the display device 100 may extract text from displayed images by applying known image-recognition techniques. For example, if the content being displayed by the display device 100 is a film received from the content broadcaster 102 having subtitles that are embedded as images, the display device 100 can extract these by analysing the images of the film itself.
In any case, the display device 100 may configure the visually presented message in a text format, such as through use of ASCII characters, and send the text to the electronic device 200 via a data connection such as a Bluetooth connection. As mentioned above, in some examples the display device 100 first pre-processes the text by applying one or more front-end processes such as text normalization, text-to-phoneme conversion, dividing the text into prosodic units, etc.
The request may be a one-time request for text currently (or about to be) displayed by the display device 100. Alternatively, the request may be a request to initiate multiple instances of text being sent to the electronic device 200. That is, the display device 100 may provide, in response to a single (initial) request, any currently displayed text to the electronic device 200 on a dynamic basis in substantially real-time. To do so, the display device 100 may provide updated text to the electronic device 200 as and when the text it is displaying (see S302 below) updates. Alternatively, the display device 100 may provide currently displayed text to the electronic device on a regular basis, e.g. once a second.
At S302, the display device 100 displays the text. Note that this may be performed by the display device 100 before it has received the request from and/or before it has provided the text to the electronic device 200 (S300 and S301, respectively). In other words, the display device 100 may have already been displaying the text (S302) when the request was received. In other examples, as mentioned above, the display device 100 receives an initial request and begins continuously sending text to the electronic device 200 substantially in real-time. Depending, for example, on user preferences, the text streamer module 113 on the display device 100 may choose the text to be streamed from one or more text content sources, such as GUI texts, DVB Subtitles, closed or open captions, EPG tables or any customised textual description of the visually presented contents and GUI elements.
At S303, the electronic device 200 performs a text-to-speech (TTS) process on the received text to generate corresponding audio from the text. If the display device 100 has not pre-processed the text, as mentioned above, then the electronic device 200 may perform the pre-processing.
At S304, the electronic device 200 sends the generated audio to a speaker for outputting to the user 400. In the example of Figure 3, the electronic device 200 sends the audio to headphones 402. However, it is appreciated that the audio could be sent to an external speaker 401 or to an internal speaker of the electronic device 200 itself.
In any case, the external device (the headphones 402 in the example of Figure 2) then outputs the audio at S305. Hence, the user 400 acquires the semantic content of the text displayed on the display device 100 by listening to the audio.
It will be understood that the processor or processing system or circuitry referred to herein may in practice be provided by a single chip or integrated circuit or plural chips or integrated circuits, optionally provided as a chipset, an application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), digital signal processor (DSP), graphics processing units (GPUs), etc. The chip or chips may comprise circuitry (as well as possibly firmware) for embodying at least one or more of a data processor or processors, a digital signal processor or processors, baseband circuitry and radio frequency circuitry, which are configurable so as to operate in accordance with the exemplary embodiments. In this regard, the exemplary embodiments may be implemented at least in part by computer software stored in (non-transitory) memory and executable by the processor, or by hardware, or by a combination of tangibly stored software and hardware (and tangibly stored firmware).
Reference is made herein to data storage for storing data. This may be provided by a single device or by plural devices. Suitable devices include for example a hard disk and non-volatile semiconductor memory.
Although at least some aspects of the embodiments described herein with reference to the drawings comprise computer processes performed in processing systems or processors, the invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of non-transitory source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other non-transitory form suitable for use in the implementation of processes according to the invention. The carrier may be any entity or device capable of carrying the program. For example, the carrier may comprise a storage medium, such as a solid-state drive (SSD) or other semiconductor-based RAM; a ROM, for example a CD ROM or a semiconductor ROM; a magnetic recording medium, for example a floppy disk or hard disk; optical memory devices in general; etc.
The examples described herein are to be understood as illustrative examples of embodiments of the invention. Further embodiments and examples are envisaged. Any feature described in relation to any one example or embodiment may be used alone or in combination with other features. In addition, any feature described in relation to any one example or embodiment may also be used in combination with one or more features of any other of the examples or embodiments, or any combination of any other of the examples or embodiments. Furthermore, equivalents and modifications not described herein may also be employed within the scope of the invention, which is defined in the claims.

Claims

A method of outputting audio comprising:
an electronic device sending a request to a display device for the display device to transmit display text to the electronic device;

the electronic device receiving display text from the display device;

the electronic device causing the received display text to be converted to corresponding audio; and

the electronic device sending the audio to a loudspeaker for output to a user.
A method according to claim 1, wherein causing the received display text to be converted to corresponding audio comprises applying a text-to-speech function to the received display text at the electronic device to generate the corresponding audio.
A method according to claim 1 or claim 2, wherein causing the received display text to be converted to corresponding audio comprises sending the received display text to an external text-to-speech function and receiving the corresponding audio from the external text-to-speech function.
A method according to any of claims 1 to 3, wherein the received display text comprises at least one of: a specific public broadcast item, a menu item, a subtitle, and Electronic Program Guide content.
A method according to any of claims 1 to 4, wherein the display text is text that is currently being displayed or about to be displayed by the display device.
A method according to any of claims 1 to 5, wherein the electronic device is a mobile device.
A method according to any of claims 1 to 6, wherein the request is for the display device to transmit multiple instances of display text to the electronic device.
A method for enabling display text to be presented by an electronic device as audio, the method comprising:
a display device receiving a request from an electronic device for the display device to transmit display text to the electronic device; and

the display device sending the display text to the electronic device such that the electronic device can cause the received display text to be converted to corresponding audio and send the audio to a loudspeaker for output to a user.
A method according to claim 8, comprising the display device sending multiple instances of display text to the electronic device.
A method according to claim 9, wherein each instance of display text is a display text currently displayed or about to be displayed by the display device.
A computer program comprising instructions such that when the computer program is executed on an electronic device, the electronic device is arranged to:
send a request to a display device for the display device to transmit display text to the electronic device;

receive display text from the display device;

cause the received display text to be converted to corresponding audio; and

send the audio to a loudspeaker for output to a user.
A computer program comprising instructions such that when the computer program is executed on a display device, the display device is arranged to:
receive a request from an electronic device for the display device to transmit display text to the electronic device; and

send the display text to the electronic device such that the electronic device can cause the received display text to be converted to corresponding audio and send the audio to a loudspeaker for output to a user.
An electronic device, the electronic device being constructed and arranged to:
send a request to a display device for the display device to transmit display text to the electronic device;

cause display text received from the display device to be converted to corresponding audio; and

send the audio to a loudspeaker for output to a user.
An electronic device according to claim 13, configured to cause the received display text to be converted to corresponding audio by applying a text-to-speech function to the received display text at the electronic device to generate the corresponding audio.
A display device, the display device being constructed and arranged to:
receive a request from an electronic device for the display device to transmit display text to the electronic device; and

send the display text to the electronic device such that the electronic device can cause the received display text to be converted to corresponding audio and send the audio to a loudspeaker for output to a user.