CN113658598B

CN113658598B - Voice interaction method of display equipment and display equipment

Info

Publication number: CN113658598B
Application number: CN202110922915.8A
Authority: CN
Inventors: 冯建斌; 吴圣春
Original assignee: Vidaa Netherlands International Holdings BV
Current assignee: Vidaa Netherlands International Holdings BV
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2024-02-27
Anticipated expiration: 2041-08-12
Also published as: CN113658598A

Abstract

The embodiment provides a voice interaction method of display equipment and the display equipment, wherein a display of the display equipment displays a user interface, and webpage elements used for distinguishing different contents in the user interface, wherein the webpage elements comprise keywords for mapping the webpage elements. The display device further includes a sound collector, and the controller recognizes a voice text from the voice signal after receiving the voice signal input by the user. If there is a keyword matching the phonetic text in the user interface, a click operation is performed on the phonetic text matching keyword mapped web page element. If no keyword matching the phonetic text exists in the user interface, no click operation is performed on the web page element. According to the embodiment, the keywords mapped to the webpage elements are displayed on the webpage elements, so that a user can express the webpage elements which want to interact quickly and accurately, and the interaction experience of the user and the display equipment is improved.

Description

Voice interaction method of display equipment and display equipment

Technical Field

The application relates to the technical field of display equipment, in particular to a voice interaction method of display equipment and the display equipment.

Background

With the development of smart television products, users can browse web pages on the browser of the embedded device of the smart television. However, due to the limitation of the device, the web page is browsed on the smart television, the remote controller is required to simulate the movement of the mouse to browse the web page content, and the interaction mode is quite complex. At present, the development of voice interaction technology makes it possible to control a browser by voice on a smart television.

However, the content similarity of the current webpage elements is higher, and the user is difficult to distinguish each webpage element, so that the webpage elements which want to interact cannot be expressed quickly and accurately, and the interaction experience between the user and the display device is poor.

Disclosure of Invention

The application provides a voice interaction method of display equipment and the display equipment, which are used for solving the problems that the current webpage element content has higher similarity, and a user is difficult to distinguish each webpage element, so that the webpage element which wants to interact cannot be expressed quickly and accurately, and the watching experience of the user is poor.

In a first aspect, the present embodiment provides a display device, including,

a display for displaying a user interface and web page elements in the user interface for distinguishing different contents, wherein the web page elements contain keywords for mapping the web page elements;

the sound collector is used for collecting voice signals input by a user;

a controller for performing:

receiving a voice signal input by a user, and recognizing a voice text from the voice signal; the method comprises the steps of carrying out a first treatment on the surface of the

Executing clicking operation on the webpage element mapped by the keyword matched with the voice text when the keyword matched with the voice text exists in the user interface;

and when the keyword matched with the voice text does not exist in the user interface, not executing clicking operation on the webpage element.

In a second aspect, the present embodiment provides a voice interaction method of a display device, where the method is applied to a controller of the display device, and the display of the display device is used to display a user interface, and in the user interface, web page elements used to distinguish different contents, where the web page elements include keywords that map the web page elements, and the method includes:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a usage scenario of a display device according to some embodiments;

fig. 2 shows a hardware configuration block diagram of the control apparatus 100 according to some embodiments;

fig. 3 illustrates a hardware configuration block diagram of a display device 200 according to some embodiments;

FIG. 4 illustrates a software configuration diagram in a display device 200 according to some embodiments;

FIG. 5 illustrates a schematic diagram of a voice interaction principle, according to some embodiments;

FIG. 6 illustrates a user interface schematic diagram in a display device 200 according to some embodiments;

fig. 7 illustrates a flow diagram of a method of voice interaction for a display device in accordance with some embodiments.

Detailed Description

For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.

It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "first," second, "" third and the like in the description and in the claims and in the above drawings are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

Fig. 1 is a schematic diagram of a usage scenario of a display device according to an embodiment. As shown in fig. 1, the display device 200 is also in data communication with a server 400, and a user can operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes at least one of infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display apparatus 200 by inputting a user instruction through at least one of a key on a remote controller, a voice input, a control panel input, and the like.

In some embodiments, the smart device 300 may include any of a mobile terminal 300A, a tablet, a computer, a notebook, an AR/VR device, etc.

In some embodiments, the smart device 300 may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the smart device 300 and the display device may also be used for communication of data.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control apparatus configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

In some embodiments, software steps performed by one step execution body may migrate on demand to be performed on another step execution body in data communication therewith. For example, software steps executed by the server may migrate to be executed on demand on a display device in data communication therewith, and vice versa.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

Fig. 3 shows a hardware configuration block diagram of the display device 200 in accordance with an exemplary embodiment.

In some embodiments, display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, memory, a power supply, a user interface.

In some embodiments the controller comprises a central processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, for receiving an image signal from the controller output, for displaying video content, image content, and components of a menu manipulation interface, and a user manipulation UI interface, etc.

In some embodiments, the display 260 may be at least one of a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.

In some embodiments, the modem 210 receives broadcast television signals via wired or wireless reception and demodulates audio-video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display apparatus 200 may establish transmission and reception of control signals and data signals with the control device 100 or the server 400 through the communicator 220.

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

In some embodiments, the external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, or the like. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the object may be any one of selectable objects, such as a hyperlink, an icon, or other operable control. The operations related to the selected object are: displaying an operation of connecting to a hyperlink page, a document, an image, or the like, or executing an operation of a program corresponding to the icon.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, graphics processor (Graphics Processing Unit, GPU), RAM Random Access Memory, RAM), ROM (Read-Only Memory, ROM), first to nth interfaces for input/output, a communication Bus (Bus), and the like.

A CPU processor. For executing operating system and application program instructions stored in the memory, and executing various application programs, data and contents according to various interactive instructions received from the outside, so as to finally display and play various audio and video contents. The CPU processor may include a plurality of processors. Such as one main processor and one or more sub-processors.

In some embodiments, a graphics processor is used to generate various graphical objects, such as: at least one of icons, operation menus, and user input instruction display graphics. The graphic processor comprises an arithmetic unit, which is used for receiving various interactive instructions input by a user to operate and displaying various objects according to display attributes; the device also comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display.

In some embodiments, the video processor is configured to receive an external video signal, perform at least one of decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, image composition, and the like according to a standard codec protocol of an input signal, and obtain a signal that is displayed or played on the directly displayable device 200.

In some embodiments, a user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

In some embodiments, a "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a form acceptable to the user. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

In some embodiments, a system of display devices may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together form the basic operating system architecture that allows users to manage files, run programs, and use the system. After power-up, the kernel is started, the kernel space is activated, hardware is abstracted, hardware parameters are initialized, virtual memory, a scheduler, signal and inter-process communication (IPC) are operated and maintained. After the kernel is started, shell and user application programs are loaded again. The application program is compiled into machine code after being started to form a process.

As shown in fig. 4, a system of display devices may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together form the basic operating system architecture that allows users to manage files, run programs, and use the system. After power-up, the kernel is started, the kernel space is activated, hardware is abstracted, hardware parameters are initialized, virtual memory, a scheduler, signal and inter-process communication (IPC) are operated and maintained. After the kernel is started, shell and user application programs are loaded again. The application program is compiled into machine code after being started to form a process.

As shown in fig. 4, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom. The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; native applications (Native APPs);

the application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).

Native applications (Native APPs) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

In order to clearly illustrate the embodiments of the present application, a voice recognition network architecture provided in the embodiments of the present application is described below with reference to fig. 5.

Referring to fig. 5, fig. 5 is a schematic diagram of a voice recognition network architecture according to an embodiment of the present application. In fig. 5, the smart device is configured to receive input information and output a processing result of the information. The voice recognition service equipment is electronic equipment deployed with voice recognition service, the semantic service equipment is electronic equipment deployed with semantic service, and the business service equipment is electronic equipment deployed with business service. The electronic device herein may include a server, a computer, etc., where a speech recognition service for recognizing audio as text, a semantic service (which may also be referred to as a semantic engine) for semantically parsing text, and a business service for providing specific services such as weather query service of ink weather, music query service of QQ music, etc., are web services that may be deployed on the electronic device. In one embodiment, there may be multiple entity service devices deployed with different service services in the architecture shown in fig. 5, and one or more entity service devices may also aggregate one or more functional services.

In some embodiments, the following describes an example of a process of processing information input to the smart device based on the architecture shown in fig. 5, taking the information input to the smart device as a query sentence input through voice as an example, the above process may include the following three processes:

[ Speech recognition ]

The intelligent device may upload the audio of the query sentence to the voice recognition service device after receiving the query sentence input through the voice, so that the voice recognition service device recognizes the audio as text through the voice recognition service and returns the text to the intelligent device. In one embodiment, the intelligent device may denoise the audio of the query statement prior to uploading the audio of the query statement to the speech recognition service device, where the denoising may include steps such as removing echoes and ambient noise.

Semantic understanding

The intelligent device uploads the text of the query sentence identified by the voice recognition service to the semantic service device, so that the semantic service device performs semantic analysis on the text through semantic service to obtain the service field, intention and the like of the text.

[ semantic response ]

And the semantic service equipment issues a query instruction to the corresponding service equipment according to the semantic analysis result of the text of the query statement so as to acquire a query result given by the service. The intelligent device may obtain the query result from the semantic service device and output. As an embodiment, the semantic service device may further send a semantic parsing result of the query statement to the smart device, so that the smart device outputs a feedback statement in the semantic parsing result.

It should be noted that the architecture shown in fig. 5 is only an example, and is not intended to limit the scope of the present application. Other architectures may also be employed to achieve similar functionality in embodiments of the present application, for example: all or part of the three processes can be completed by the intelligent terminal, and are not described in detail herein.

In some embodiments, the smart device shown in fig. 5 may be a display device, such as a smart tv, and the functions of the voice recognition service device may be implemented by a sound collector and a controller disposed on the display device in cooperation, and the functions of the semantic service device and the business service device may be implemented by a controller of the display device, or implemented by a server of the display device.

With the development of voice interaction technology, more and more household terminal devices have voice interaction functions. By utilizing the voice interaction function, the user can control the terminal devices to execute corresponding operations, such as starting, stopping and the like, through voice.

In order to solve the above problems, the present application provides a display apparatus, in which a user can input a control instruction using a control device or input a control instruction by voice.

After the user has displayed the device, a user interface, which may be a previously created web interface, is displayed on the display. The web page interface contains web page elements for distinguishing different contents. The web page elements each contain keywords for mapping the web page elements. Therefore, the user can quickly and accurately express the webpage elements which want to interact according to the keywords.

The display device further comprises a sound collector for collecting voice signals input by a user. After the voice collector collects the voice signal input by the user, the voice signal is sent to the controller. The controller receives a voice signal input by a user and recognizes a voice text from the voice signal. The recognition of speech text from speech signals is prior art and is not described in detail in this application.

And the controller further judges whether the current user interface has keywords matched with the voice text according to the recognized voice text. If there is a keyword matching the voice text at the current user interface, a click operation is performed on the webpage element mapped by the keyword matching the voice text. If no keyword matching the voice text exists in the current user interface, no clicking operation is performed on the webpage element.

Illustratively, as shown in the user interface schematic of FIG. 6, after a user opens the display device, a user interface is displayed on the display. The user interface includes a plurality of video elements. The user can simulate mouse operation by using the remote controller and click video elements in the mouse operation, so that click operation is realized, and corresponding videos are opened.

The video elements all contain corresponding keywords. The video element a contains the keyword "potato". If the phonetic text "i want to see the piggy-petted potato" is recognized from the voice signal input by the user, the phonetic text matches the keyword "potato" contained in the video element a. A click operation is performed on video element a. The video element B contains the keyword "piggy-cookie". If the voice text "play pig cookie" is recognized from the voice signal input by the user, the voice text matches the keyword "pig cookie" contained in the video element B. A click operation is performed on video element B.

It should be noted that, the matching of the voice text and the keyword included in the web page element may be that the voice text completely overlaps with the keyword, or that the voice text includes the keyword, or that the keyword includes the voice text. The specific matching rules can be set according to actual conditions, and the application is not limited to rules for matching the voice text and the keywords.

If no keyword matching the phonetic text exists in the current user interface, no click operation is performed. For example, a voice signal input from a user may recognize the voice text "I want to see a dragon fight", but no click operation is performed if there is no web page element matching the voice text in the current user interface. At this time, a prompt dialog box may be launched in the interface to prompt that no web page element associated with the voice text exists in the current user interface. In some embodiments, the web page elements all contain element text, while the keywords in the above embodiments are contained in the element text, and the keywords are highlighted in the element text. The highlighting may be by highlighting, magnifying, bolding, displaying with a different color than the other text, thereby distinguishing keywords from other text in the element text. For example, in the user interface shown in fig. 6, the video element a contains the element text "pig cookie fifth season: the potato city ", the keyword" potato "is contained in the element text, and in the element text, the keyword" potato "is highlighted. Therefore, each webpage element can be rapidly distinguished, and the webpage elements which want to interact can be rapidly and accurately expressed.

In some embodiments, the specific steps of determining keywords from the element text are:

and performing word segmentation processing on the element text according to different language types to generate at least one word segmentation word. And then sequencing the word segmentation words according to the sequence from short word segmentation words to long word segmentation words to obtain a word list. And determining the word segmentation words positioned at the first position in the vocabulary list as keywords.

For example, in the user interface shown in fig. 6, the video element a contains the element text "pig eupatorium fifth season: potato City ". The element text "piggy-pecies fifth season: after the potato city is extracted, text processing is carried out, and the processing mode mainly comprises the steps of word segmentation, part-of-speech tagging, emotion analysis and the like. Text processing may select snownlp, jieba, stanfordCoreNLP (the three tools illustrated are python written mid library) or the like open source word segmentation tool. The selection of the word segmentation method is not limited, and the specific word segmentation process is the prior art and is not repeated. After word segmentation, word segmentation words can be obtained: piggy-back, fifth season, potato, city. And sequencing the word segmentation words from short to long according to the length to obtain a vocabulary list. The word list is divided into words with the following arrangement sequence: potato, city, fifth season, piglet petty. The word "potato" ordered as compared to the first word is ultimately determined as the final keyword. In fig. 6, the keyword of the video element a is determined as "potato", and the video element a can be clearly distinguished from other video elements.

In some embodiments, if the word-segmented word located at the first position in the vocabulary list obtained according to the above embodiments is already used as a keyword of other web page elements, the word-segmented word located at the second position in the vocabulary list is determined to be the keyword. If the word segmentation word positioned at the second position in the vocabulary list is still used as the key word of other webpage elements, continuing to take the word segmentation word positioned at the next position downwards as the key word of the webpage elements. Until the determined keywords are not repeated with the keywords of other web page elements. For example, in fig. 6, after the keyword position "potato" of the video element a has been determined, and then the keyword of the video element C is determined, the word of the word in the first position is also "potato", and then the "planet" in the second position is determined as the keyword of the video element C. Thereby distinguishing the keywords of video element C from the keywords of video a.

In some embodiments, if the word segmentation words in the vocabulary list are all keywords of other web page elements, the word segmentation words in the vocabulary list are combined to generate combined words, and the combined words are determined to be the keywords of the web page elements. Or, inserting numbers into the sub-word words, generating inserted number words, and determining the inserted number words as keywords of the webpage elements. The final aim is to make the keywords of each webpage element of the current user interface different, so that the user can rapidly distinguish each webpage element.

For example, in the user interface shown in fig. 6, the element text of video element F is also "pig-pecified fifth season: potato city ", but in the vocabulary list: the piggy, the fifth season, the potato, the city, all have been utilized by other video elements, then the words in the vocabulary list may be merged as the final keywords. For example, combining "pig cookie" and "fifth season" results in the combined word "pig cookie fifth season" to distinguish video element F from other video elements. Alternatively, inserting a number in the "pig cookie", for example, inserting the number 1, resulting in the digital word "pig cookie 1", may also distinguish video element D from other video elements. In some embodiments, it may also be preferable to use words in the vocabulary list that are part of speech as keywords. This is more advantageous for distinguishing individual web page elements.

In some embodiments, the current web page element and other web page elements in the above embodiments are both within the web page viewport. That is, the current web page element and other web page elements are both within the region visible to the naked eye of the current user, and the user is not required to scroll through the web page. Because the pages of a web page are typically large, it may be necessary for a user to scroll up and down through the web page to view the content of the entire web page. At this time, if the web page element keywords are determined, only the web page elements in the user visible area are subjected to the duplication removal process. For example, in the user interface shown in fig. 6, video elements a through F are in the area that is currently visible to the naked eye of the user, while video elements G through I are not in the area that is currently visible to the naked eye of the user, i.e., video elements G through I require the user to scroll through the page to enter the area that is visible to the naked eye. Therefore, in determining the keywords of the video elements a to F, only whether the keywords of the six video elements a to F are repeated or not needs to be considered, and the keywords of the video elements G to I need not be considered.

If the user scrolls the page such that video element A through video element C move out of the current user macroscopic region, video element D through video element F remain in the current user macroscopic region, and video element G through video element I are moved into the current user macroscopic region. The keywords of video element D through video element F need to be redetermined from the element text of video element G through video element I. Alternatively, the keywords from video element D to video element F are unchanged, and the keywords from video element G to video element I need to be determined according to the keywords from video element D to video element F, without considering the keywords from video element a to video element C. So that keywords of video elements D to I within the area visible to the naked eye of the current user are not repeated.

The specific implementation manner of the above embodiment is as follows:

and extracting text content of a webpage element which can respond to the click event in the current webpage viewport, establishing a mapping relation between the text content and DOM (Document Object Model ) element nodes, and storing the mapping relation. The desired click event responsive web page element may be looked up by a CSS (Cascading Style Sheets, cascading style sheet) selector. The web page elements capable of responding to the click time are typically button tags, a tags, tags with the button in div, tags with the button in input type (the tags are all tags in HTML, the button tag has the meaning of a button, and the a tag has the meaning of a hyperlink), etc. The size of each element and its position relative to the viewport is then returned by the element. Thereby determining whether the web page element is within the viewport region. And then extracting text content in the webpage elements, extracting keywords from the text content, and simultaneously establishing a mapping relation between the keywords and the webpage elements. After the user inputs the voice text matched with the keywords, the clicking event can be simulated, and the corresponding webpage element is clicked.

In some embodiments, after determining keywords in the web page, a set of keywords may be established locally, the set of keywords including keywords for the determined web page elements. When determining the keywords of the next webpage element, traversing the keyword combination to judge whether repeated keywords exist, if so, discarding the current vocabulary, and continuing to judge the next vocabulary in the vocabulary list. And determining the current word as the keyword of the current webpage element until the current word is not repeated with the keywords in the keyword set. The text content can also be selectively transmitted to the cloud server for unique keyword generation. The cloud server can establish a dictionary library according to a specific scene. Weights of words may be set in the dictionary library, for example, a high weight may be set for words with a high speech recognition rate and a low weight may be set for words with a low speech recognition rate. Therefore, words with high voice recognition rate are easier to select when keywords of webpage elements are determined, and user interaction experience is further improved.

An embodiment of the present application provides a voice interaction method of a display device, such as a flowchart of the voice interaction method of the display device shown in fig. 7, where the method includes the following steps:

step one, a sound collector of the display device receives a voice signal input by a user and recognizes a voice text from the voice signal.

And step two, if keywords matched with the voice text exist in the user interface, clicking operation is carried out on the webpage elements mapped by the keywords matched with the voice text.

And step three, if no keyword matched with the voice text exists in the user interface, not executing clicking operation on the webpage element.

Wherein, there are web page elements in the user interface for distinguishing different contents, the web page elements contain keywords for mapping the web page elements. The user can quickly and accurately express the webpage elements which want to interact by looking up the keywords contained in the webpage elements, so that the interaction experience of the user and the display equipment is improved.

In some embodiments, the web page elements contain element text, and the element text contains keywords, which are highlighted in the element text. For example, may be highlighted or displayed in a different color than the other text content.

In some embodiments, the step of determining keywords for the element text is: and performing word segmentation processing on the element text to generate at least one word segmentation word. And sequencing all word segmentation words according to the sequence from short length to long length to generate a vocabulary list. And finally, selecting the word in the first word as a keyword, namely selecting the shortest word as the keyword. The nouns in the word segmentation words can also be directly selected as keywords. A web page element can have a plurality of keywords, and a user can select the keywords which the user likes to interact with during interaction.

The same or similar content may be referred to each other in various embodiments of the present application, and the related embodiments are not described in detail.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, characterized by comprising:

a display for displaying a user interface and web page elements for distinguishing different contents in the user interface, wherein the web page elements comprise element texts, the element texts comprise keywords, the web page elements comprise keywords for mapping the web page elements, and the keywords are highlighted in the element texts;

the sound collector is used for collecting voice signals input by a user;

a controller for performing:

receiving a voice signal input by a user, and recognizing a voice text from the voice signal;

when the keyword matched with the voice text does not exist in the user interface, not executing clicking operation on the webpage element;

the matching includes that the voice text and the keyword completely overlap, the voice text includes the keyword, and the keyword includes the voice text.

2. The display device of claim 1, wherein the specific step of determining the keyword from the element text is:

performing word segmentation processing on the element text to generate at least one word segmentation word;

sorting the word segmentation words according to the sequence from short to long according to the length of the word segmentation words, and generating a vocabulary list;

and determining the word segmentation words positioned at the first position in the vocabulary list as the keywords.

3. The display device of claim 2, wherein the specific step of determining the keyword from the element text further comprises:

and when the word segmentation words positioned at the first position in the vocabulary list are the keywords of other webpage elements, determining the word segmentation words positioned at the second position in the vocabulary list as the keywords.

4. A display device as recited in claim 3, wherein the specific step of determining the keywords from the element text further comprises:

when all the word segmentation words in the vocabulary list are the keywords of other webpage elements, combining the word segmentation words in the vocabulary list to generate combined words, and determining the combined words as the keywords;

or inserting numbers in the word segmentation words, generating inserted number words, and determining the inserted number words as the keywords.

5. A display device as claimed in claim 3, wherein the other web page elements and the current web page element are both within a web page viewport.

6. The display device of claim 1, wherein the specific step of determining the keyword from the element text is:

performing word segmentation processing on the element text to obtain at least one word segmentation word;

and when the word segmentation words with the part of speech as the nouns exist in the word segmentation words, determining the word segmentation words with the part of speech as the nouns as the keywords.

7. A voice interaction method of a display device, the method being applied to a controller of the display device, characterized in that a display of the display device is used for displaying a user interface, and web page elements for distinguishing different contents in the user interface, wherein the web page elements contain element texts, the element texts contain keywords, the web page elements contain keywords mapping the web page elements, and the keywords are highlighted in the element texts; the method comprises the following steps:

8. The voice interaction method of a display device according to claim 7, wherein the specific step of determining the keyword from the element text is:

sorting the word segmentation words according to the sequence from short to long according to the length of the word segmentation words, and generating a vocabulary list; and determining the word segmentation words positioned at the first position in the vocabulary list as the keywords.