CN111131889B

CN111131889B - Method and system for adaptively adjusting images and sounds in scene and readable storage medium

Info

Publication number: CN111131889B
Application number: CN201911424251.1A
Authority: CN
Inventors: 付华东; 王余生; 许福; 王鵾
Original assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Current assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-11-25
Anticipated expiration: 2039-12-31
Also published as: CN111131889A

Abstract

The invention provides a method, a system and a computer readable storage medium for adaptively adjusting images and sounds in a scene, wherein the method comprises the following steps: acquiring a current display picture, and identifying the current display picture through a trained deep learning model to obtain a scene identification result matched with the current display picture; and sending the scene recognition result to the video source terminal mapped by the current display picture so that the video source terminal adjusts the image parameter and the sound parameter according to the scene recognition result. By identifying the current display picture and sending the identification result to the video source terminal, even equipment without a scene identification function can carry out optimal image and sound adjustment according to different scenes.

Description

Method and system for adaptively adjusting images and sounds in scene and readable storage medium

Technical Field

The present invention relates to the field of terminals, and in particular, to a method, a system, and a computer-readable storage medium for adaptively adjusting images and sounds in a scene.

Background

The playing content of the current television set emphasizes the improvement of audio-visual experience, and in order to bring better audio-visual experience to users, the television set needs to debug image and sound parameters before leaving a factory. Because the video and audio of the played content are different, the video and audio of all the played content are difficult to realize through a set of fixed image and sound parameters, and the audio has good audio-visual effect. In order to solve this problem, in the prior art, the currently played picture is identified to perform the optimal image and sound adjustment according to different playing scenes. However, since a picture to be played back cannot be recognized in a device having no scene recognition function, it is impossible to perform optimum image and sound adjustments for different scenes to be played back.

Disclosure of Invention

The invention mainly aims to provide a method, a system and a computer readable storage medium for adaptively adjusting images and sounds in a scene, and aims to solve the problems that in the prior art, the scene recognition speed is low, the recognition accuracy is not high, and equipment without a scene recognition function cannot perform optimal image and sound adjustment according to different playing scenes.

In order to achieve the above object, the present invention provides a method for adaptively adjusting images and sounds in a scene, the method comprising the steps of:

acquiring a current display picture, and identifying the current display picture through a trained deep learning model to obtain a scene identification result matched with the current display picture;

and sending the scene recognition result to the video source terminal mapped by the current display picture so that the video source terminal adjusts the image parameter and the sound parameter according to the scene recognition result.

Optionally, before the step of acquiring the current display screen, the method further includes:

collecting sample images of different classes from a sample library;

and training the deep learning model according to the sample images of different categories to obtain the trained deep learning model.

Optionally, the step of training the deep learning model according to the different classes of sample images includes:

acquiring corresponding scene label data according to the type of the sample image;

and taking the sample image as the input of the deep learning model, outputting predicted scene label data after the preset deep learning model is operated, and training the deep learning model according to the predicted scene label data and the scene label data corresponding to the sample image to obtain the trained deep learning model.

Optionally, the step of using the sample image as an input of the deep learning model, outputting predicted scene tag data after the preset deep learning model is run, and training the deep learning model according to the predicted scene tag data and the scene tag data corresponding to the sample image to obtain a trained deep learning model includes:

inputting a sample image into a deep learning model, so that the deep learning model outputs predicted scene label data, and adding 1 to the accumulated training times;

comparing scene label data corresponding to the sample image with the predicted scene label data to obtain a loss function;

adjusting parameters of a deep learning model according to the loss function so as to update the deep learning model;

judging whether the accumulated training times reach a preset training threshold value or not;

stopping training when the accumulated training times reach a preset training threshold value, and taking the deep learning model reaching the preset training threshold value as a deep learning model after training;

and when the accumulated training times do not reach a preset training threshold value, acquiring a new sample image and executing the step of inputting the sample image into the deep learning model.

Optionally, the step of recognizing the current display screen through the trained deep learning model to obtain a scene recognition result matched with the current display screen includes:

and inputting the current display picture into a trained deep learning model to obtain a corresponding scene recognition result.

Optionally, the step of sending the scene recognition result to the video source terminal mapped by the current display frame so that the video source terminal adjusts the image parameter and the sound parameter according to the scene recognition result includes:

and sending the acquired scene recognition result to a video source terminal so that the video source terminal can acquire the image parameters and the sound parameters of the current display picture and the adjusting parameters of the current display picture in a preset mapping relation table corresponding to the scene recognition result so as to adjust the image parameters and the sound parameters according to the adjusting parameters.

Optionally, the step of sending the scene recognition result to the video source terminal mapped by the current display frame, so that the video source terminal adjusts the image parameter and the sound parameter according to the scene recognition result, further includes:

and sending the scene recognition result to audio output equipment so that the audio output equipment adjusts the sound parameters according to the scene recognition result.

Optionally, the step of sending the scene recognition result to an audio output device so that the audio output device adjusts the sound parameter according to the scene recognition result includes:

and sending the obtained scene recognition result to audio output equipment so that the audio output equipment obtains the current sound parameter and the adjusting parameter corresponding to the scene recognition result in a preset audio mapping relation table corresponding to the scene recognition result to adjust the sound parameter according to the adjusting parameter.

To achieve the above object, the present invention further provides a system including a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the steps of the method for scene adaptive adjustment of images and sounds as described above.

To achieve the above object, the present invention further provides a computer-readable storage medium, having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of the method for adaptively adjusting images and sounds in a scene as described above.

According to the method, the system and the computer readable storage medium for adaptively adjusting the images and the sounds in the scene, the current display picture is obtained, and the trained deep learning model is used for identifying the current display picture, so that a scene identification result matched with the current display picture is obtained; and sending the scene recognition result to the video source terminal mapped by the current display picture so that the video source terminal adjusts the image parameter and the sound parameter according to the scene recognition result. By identifying the current display picture and sending the identification result to the video source terminal, even equipment without a scene identification function can perform optimal image and sound adjustment according to different scenes, so that the equipment can obtain good audio-visual effect under different playing pictures.

Drawings

FIG. 1 is a flowchart illustrating a first embodiment of a method for adaptively adjusting images and sounds according to the present invention;

FIG. 2 is a flowchart illustrating a second embodiment of a method for adaptively adjusting images and sounds according to the present invention

FIG. 3 is a flowchart illustrating a step S40 of a second embodiment of a method for adaptively adjusting images and sounds according to the present invention;

fig. 4 is a schematic block diagram of the system of the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a method for adaptively adjusting images and sounds in a scene, referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the method for adaptively adjusting images and sounds in a scene according to the invention, and the method includes the steps of:

step S10, acquiring a current display picture, and recognizing the current display picture through a trained deep learning model to obtain a scene recognition result matched with the current display picture;

the deep learning model in this scheme may be a multi-class neural network model, which is illustrated in this embodiment as mobilene-v 1, where mobilene-v 1 is a model that constructs a lightweight weight deep neural network using deep separable convolution based on a streamlined structure. The MobileNet is further deeply researched to design the MobileNet after a depthwise partial constants using method, and the essence of the depthwise partial constants is sparse expression with less redundant information. Two choices of efficient model design are provided on the basis: width factor (width multiplexer) and resolution factor (resolutionmultiplier); by balancing size, delay time and precision, a smaller-scale and faster-speed MobileNet is constructed. The Google team also demonstrated the effectiveness of MobileNet as an efficient infrastructure network through diverse experiments. MobileNet uses a convolution mode called deep-wise to replace the original traditional 3D convolution, and reduces redundant expression of convolution kernels. After the calculation amount and the parameter number are obviously reduced, the convolution network can be applied to more mobile terminal platforms.

MobileNet achieves a very good balance in terms of computational load, memory space and accuracy. The amount of computation is reduced by a factor of 30 with a small loss of accuracy compared to VGG 16. The MobileNet plays a significant role in terminal intelligent application with strict requirements on real-time performance, storage space and energy consumption, such as an automatic driving automobile, a robot and an unmanned aerial vehicle. Meanwhile, the MobileNet can be transplanted to an android end and an IOS end, and has important application value.

The system mainly comprises a data acquisition module, a model training module, a model conversion module, a scene recognition module and an image and sound adjustment module.

The current display picture refers to a picture displayed on the current television, and the picture can be projected or mapped to the current television through other video source terminals or can be a picture directly played by the current television. The realization of the function of recognizing the current display picture through the trained deep learning model can be based on a server and a client, wherein the server is mainly used for compiling a library file and a training model during operation, and the client is mainly used for capturing a screen of the current display picture, calling the training model to recognize the captured picture and sending a recognition result to a video source terminal. The scene recognition result comprises playing scenes with individual requirements on the picture quality, such as a ball game scene, a movie scene, a game scene and the like. It should be noted that, because the present embodiment is described based on the mobilene-v 1 model, after the current display picture is obtained, the current display picture may be scaled to an RGB picture with data dimensions of 224 × 3, that is, the length and the width of the RGB picture are all 224 pixels, if the size is too small, the information is lost too seriously, and if the size is too large, the abstraction level of the information is not high enough, the calculation amount is also larger, and the current display picture may be set according to actual needs.

The recognizing of the current display picture through the model may be inputting the current display picture into a trained deep learning model to obtain a corresponding scene recognition result. Specifically, the feature value of the current display picture can be obtained by abstracting the current display picture, the feature value of the current display picture is compared with the positive feature value and the negative feature value of each class of scene pictures, the highest similarity is taken as a prediction result, and a corresponding scene recognition result is output. The current display picture is identified through the trained deep learning model, the current display picture can be accurately identified, and a corresponding scene identification result is obtained.

And step S20, sending the scene recognition result to the video source terminal mapped by the current display picture so that the video source terminal can adjust the image parameter and the sound parameter according to the scene recognition result.

After receiving a scene recognition result corresponding to a current display picture sent by a television, adjusting image parameters and sound parameters of a current playing video according to the scene recognition result, wherein it can be understood that the adjusted image parameters and sound parameters can be directly applied to a current video source terminal, so that the playing picture and sound transmitted to the television by the current video source terminal conform to the adjusted image parameters and sound parameters.

After the video source terminal obtains the adjusting parameters of the image parameters and the sound parameters, the video source terminal can also send the adjusting parameters to the television, so that the television correspondingly adjusts the image parameters of the current scene and the sound parameters of the played sound according to the adjusting parameters.

Further, after the scene recognition result matched with the current display picture is obtained, the image parameters of the current display picture and the sound parameters of the playing sound can be directly and correspondingly adjusted through the current television according to the adjusting parameters.

In this embodiment, the television is not only a large-screen device in an AIOT (intelligent Intelligence & Internet of Things) smart home, but also a control center of the whole smart home, and it may be interconnected with all devices through bluetooth and a network. Currently, most of commonly used intelligent devices such as a sound box, an air conditioner, a refrigerator, a washing machine, a computer, a kitchen appliance and the like do not have an AI (Artificial Intelligence) function such as scene recognition, and other intelligent devices, such as a smart phone, a tablet personal computer and a sound box, do not have an AI function such as scene recognition. The invention solves the problem that some intelligent household equipment can have the AI functions of scene recognition and the like without extra cost under the condition that the equipment does not have the AI functions of scene recognition and the like, saves the artificial intelligence cost of the intelligent household equipment and brings a better audio-visual feast to users.

In the embodiment, a scene recognition result matched with a current display picture is obtained by obtaining the current display picture and recognizing the current display picture through a trained deep learning model; and sending the scene recognition result to the video source terminal mapped by the current display picture so that the video source terminal adjusts the image parameter and the sound parameter according to the scene recognition result. By identifying the current display picture and sending the identification result to the video source terminal, even the equipment without the scene identification function can also carry out optimal image and sound adjustment according to different scenes, so that the equipment can obtain good audio-visual effect under different playing scenes.

Further, referring to fig. 2, in the second embodiment of the method for adaptively adjusting images and sounds according to the scene of the present invention proposed based on the first embodiment of the present invention, before the step S10, the method further includes the steps of:

s30, collecting sample images of different categories from a sample library;

and S40, training the deep learning model according to the sample images of different categories to obtain the deep learning model after training.

Referring to fig. 3, the step S40 includes the steps of:

s41, acquiring corresponding scene label data according to the type of the sample image;

and S42, taking the sample image as the input of the deep learning model, outputting predicted scene label data after the preset deep learning model is operated, and training the deep learning model according to the predicted scene label data and the scene label data corresponding to the sample image to obtain the trained deep learning model.

The model training process may include: inputting a sample image into a deep learning model so that the deep learning model outputs predicted scene label data and the cumulative training times are added by 1; comparing scene label data corresponding to the sample image with the predicted scene label data to obtain a loss function; adjusting parameters of a deep learning model according to the loss function so as to update the deep learning model; judging whether the accumulated training times reach a preset training threshold value or not; when the accumulated training times reach a preset training threshold, stopping training, and taking the deep learning model reaching the preset training threshold as a deep learning model after training; and when the accumulated training times do not reach a preset training threshold value, acquiring a new sample image and executing the step of inputting the sample image into the deep learning model. The loss function is used to measure the degree of inconsistency between the predicted value and the true value of the model, that is, the degree of inconsistency between the predicted scene tag data and the scene tag data corresponding to the sample image in this embodiment is a non-negative true value function.

A data acquisition module: specifically, in a practical case, sample images of different categories are collected from a sample library to serve as input data of a deep learning model, and a process of training the deep learning model may be that a sample image directory train _ picture, namely the sample library, is created at first, scene sub-directories of different types, such as football, basketball, cartoon, game play, news, movie, and the like, are created under the directory, names of the scene sub-directories are corresponding to scene tags, and the sample images under the directory may be subjected to image capture on a determined scene video, or corresponding sample images may be obtained from a server; further, an unknown directory is required to be set under each scene sub-directory, and an unknown sample image is stored under the unknown directory so as to improve the scene identification accuracy. During model training, extracting a positive characteristic value from sample images under a scene subdirectory, extracting a negative characteristic value from the unknown sample images under the unknown directory, comparing the characteristic value of the test sample image with the extracted positive characteristic value and the extracted negative characteristic value, and determining a prediction result according to the sample image with the highest similarity. The prior art has low speed of identifying the current playing picture and low scene identification accuracy. Therefore, the invention provides an unknown model training method in addition to a clearly classified learning and training mobilene model, and also provides a method for extracting positive characteristic values and negative characteristic values and increasing loss functions. The invention not only shortens the scene recognition time, but also improves the scene recognition accuracy.

A model training module:

*** provides a script that trains the Mobilene-v 1 network model, which has abstracted the original picture into more easily classified feature values.

The training script is located in tensorflow/examples/image _ training/training

Switching to the tensoflow source code root directory, executing a trace.

Execute the command python tensiometric flow/instances/image _ relating/relating

--image_dir/train_picture/

--how_many_training_steps 4000--architecture mobilenet_1.0_224

Automatically downloading a corresponding model according to the parameter after architecture for training;

and setting the training times according to the parameters after how _ training _ steps, wherein the default is 4000.

It should be noted that the accuracy of the deep learning model can be improved by setting the training times, that is, presetting the training threshold and adjusting the learning rate. It will be appreciated that the more training the better, the more accuracy of training at a certain point will peak, and that continued training will instead result in overfitting. It should be noted that too low learning rate results in slow convergence rate and overfitting; since the accuracy and other problems are caused by the excessively high learning rate, an appropriate learning rate should be set to help ensure the accuracy of the model while maintaining an appropriate model convergence rate. In the case of transfer learning, since the model has already converged on the original data, a smaller learning rate should be set.

In order to avoid the problems of high training accuracy and low test scene recognition accuracy, the training classification must be definite. For example, a football scene is identified, if information such as goals, grasslands and players is set as the football scene, the characteristics cannot be correctly extracted, and finer classification such as the goals and the grasslands can be set. If elements common to a plurality of scenes are used as a specific scene for training, the extracted characteristic values are available for a plurality of scenes, which results in low scene recognition accuracy. Many scene non-common differential elements should be extracted as feature values. For example, in game scenes such as basketball, football, volleyball and the like, if the common element of the players is extracted as the characteristic value, the scene is unreasonable, because the common element of the players exists in the scenes, the non-common differential elements such as grassland, goals and the like are extracted as the characteristic value, and thus the scene recognition accuracy can be improved.

Py script, after the deep training of the mobilene-v 1 model is completed, an output _ graph. Pb model file and an output _ labels. Txt are generated under the/tmp directory. Pb file is trained tensoflow freqen file, including model topology, training parameter data and characteristic value of picture. The output _ labels.txt is the label data of the picture, the names in the output _ labels.txt files are all the names of subdirectories in the train _ picture directory, and each row of data in the label cannot be changed, otherwise, the matching error of index- > name is caused.

A model conversion module: since most video source terminals are android systems, and the deep training model trained by using a retrain. It supports low latency and small binary sized machine learning reasoning on Android, iOS and other operating systems. The pb model to tensoflow lite model command is:

conversion using bazel lower toco commands

bazel-bin/tensorflow/contrib/lite/toco/toco

--input_file＝/tmp/output_graph.pb

--input_format＝TENSORFLOW_GRAPHDEF

--output_format＝TFLITE

--output_file＝/tmp/mobilenet_v1_1.0_224.tflite

--inference_type＝FLOAT

--input_arrays＝input

--output_arrays＝final_result

--input_shapes＝1,224,224,3

Wherein: input _ clips =1,224,224,3, which indicates that the input picture is an RGB picture with 224 pixels in length and width, and executes the above command to generate a mobilene _ v1_1.0_224. Tflite model file.

The embodiment provides a complete process of deep learning model training, so that the trained deep learning model meets the actual use requirement, and the accuracy of image and sound adjustment is improved.

A scene recognition module: furthermore, AISERVICE APP needs to be called before identification, and in addition, on a television chip platform, due to the fact that CPU computing power is insufficient, a GPU needs to be used for operating a training model so as to accelerate the classification speed of picture scenes, and the situation that system blocking is caused due to the fact that CPU occupancy rate is too high is avoided. On an AI television chip platform, android NN is integrated, and provides operations in deep learning such as convolution and full connection for an interface layer developed by Google and used for supporting deep learning operations, so that an AIService APP can call a GPU to perform model operations, and the method comprises the following steps:

loading a dynamic library:

compiling sample software using tensorblow based on Android would compile to generate tensorblowite. Placing coded tensorflow. Jar, libtenssorflow _ jni.32 bits and libtenssorflow _ jni.64 bits under a source code directory of AISERVICE respectively: jar, AISERVICE/libs/tensisorFlowlite _ jni.so, AISERVICE/libs/armeabi/libtenssorFlowlite _ jni.so, AISERVICE/libs/arm64-v8 a/libtenssorFlowlite _ jni.so; after the three dynamic libraries are loaded, allowing the AISERVICE APP to call the android NN interface;

loading a training model and label:

rename output _ labels.txt file into scene _ labels.txt and put into AIS service/assets/folder; the generated mobileneet _ v1_1.0_224.Tflite file is renamed to scene _ mobileneet _ v1_1.0_224.Tflite is put into an AISERVICE/assets/folder;

android enabled nn:

set UseNAPI (True) in AIService;

since the pb model is converted into the tensoflow lite model, the input picture is an RGB picture with 224 pixels in length and width. In order to improve the scene recognition accuracy, parameters of a current video image of the AISERVICE screenshot are consistent with parameters of pictures input by pb model conversion, so that a SurfaceControl. Screen function is called in the AISERVICE to intercept a current playing content picture, the picture is zoomed into 224 × 3, namely the data dimension is 224 × 3, RGB color pictures with the length and the width being 224 pixels are consistent with the model input resolution, and the RGB color pictures are used as Mobilene-v 1 input to be subjected to picture classification.

Zooming the screenshot, converting the screenshot into reference imgData of a ByteBuffer class, storing the scene classification information of the screenshot in a labelProbArray, and calling tflAtty. In AISERVICE, capturing the positive characteristic value and the negative characteristic value of the picture of the current playing content and the pictures of various scenes for training, comparing, and taking the scene with the highest similarity as the prediction result to output.

The image and sound adjusting module: further, in a third embodiment of the method for adaptively adjusting images and sounds based on the scene of the present invention as set forth in the first embodiment of the present invention, the step S20 includes the steps of:

and sending the acquired scene recognition result to a video source terminal so that the video source terminal acquires the image parameters and the sound parameters of the current display picture and the adjusting parameters of the current display picture in a preset mapping relation table corresponding to the scene recognition result so as to adjust the image parameters and the sound parameters according to the adjusting parameters.

After the video source terminal receives the scene recognition result, firstly recognizing an input channel of the current film source signal at the moment, and acquiring the resolution and the frame rate of the film source signal; then, image adjustment parameters corresponding to an input channel, a resolution and a frame rate are obtained from a preset mapping relation table, and image signals output after the image processing of the film source signals are adjusted based on the image adjustment parameters, such as parameters of adjusting color, definition, brightness, saturation, contrast, backlight, an MEMC function, a regional dimming algorithm, dynamic contrast, regional contrast improvement, dynamic range extension, dynamic target remodeling, precise smoothing processing, MPEG noise reduction, blue level extension, color temperature and the like. For example, when the current scene recognition result is a game, attention is paid to the fluency of the game. The post-processing of the image can be closed aiming at the image effect of the game adjustment, the output delay is reduced, and the MEMC function can be added to further meet the requirements of the player. The specific classification of the parameters can be distinguished according to each scheme or chip processing capacity, the parameters can comprise parameters such as brightness, contrast, color saturation, definition, color temperature, hue, noise reduction, DDC and the like, after the parameters are selected, a proper image processing mode is selected, for example, whether the function of the MEMC needs to be increased or not is judged, and if the game is 120Hz and the game moves faster, the MEMC needs to be opened; for another example, when the current scene recognition result is a football, the motion compensation MEMC function is turned on, and the color, saturation, brightness and the like of green in the picture are enhanced; when the current scene recognition result is a match, the MEMC function, the color adjustment and the definition adjustment can be adjusted; the sky model may adjust regional dimming algorithms, colors, etc.

Further, the sound requirements of different types of programs are different, and the types of programs include: at least one of music, sports, movies, news, or games. The sound adjusting parameters include parameters such as EQ, subwoofer, surround sound, dynamic range control, virtual 3D sound effect, dialogue enhancement, bass enhancement, definition enhancement, atmos sound effect, DTS sound effect, volume and the like. For example, when music reproduction is required for music programs, a good dynamic response range is provided, and a dynamic compression threshold can be properly released according to the current system volume value, so that the dynamic range is widened; sports programs need to have better scene sense and comfortable feeling, a virtual surround sound algorithm can be activated, intelligent volume control is started, scene atmosphere can be sensed during cheering, and sudden sound increase is avoided; movie programs need to have better scene rendering effects, and a cinema-level telepresence can be realized by adopting bass boosting, dialogue enhancement and a virtual surround sound algorithm.

Further, in a fourth embodiment of the method for adaptively adjusting images and sounds based on the scene of the present invention as set forth in the first embodiment of the present invention, the step S20 further includes the steps of:

The specific process may be sending the obtained scene recognition result to an audio output device, so that the audio output device obtains the current sound parameter and the adjustment parameter corresponding to the scene recognition result in a preset audio mapping relationship table corresponding to the scene recognition result, and adjusts the sound according to the adjustment parameter.

In this embodiment, when the television is further externally connected with an audio output device, such as an intelligent sound box, the scene recognition result may be sent to the audio output device, and personalized adjustment may be performed through the audio output device. The sound is adjusted according to the scene recognition result, so that good sound effect can be obtained in different playing scenes, and the user experience is improved.

Referring to fig. 4, the system may include components of a communication module 10, a memory 20, and a processor 30 in a hardware structure. In the system, the processor 30 is connected to the memory 20 and the communication module 10, respectively, the memory 20 stores a computer program, the computer program is executed by the processor 30 at the same time, and the steps of the method embodiment are realized when the computer program is executed.

The communication module 10 may be connected to an external communication device through a network. The communication module 10 may receive a request from an external communication device, and may also send a request, an instruction, and information to the external communication device, where the external communication device may be another television, a server, or an internet of things device, such as a television.

The memory 20 may be used to store software programs as well as various data. The memory 20 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (for example, obtaining a current display screen, and recognizing the current display screen through a trained deep learning model to obtain a scene recognition result matching the current display screen), and the like; the storage data area may include a database, and the storage data area may store data or information created according to use of the system, or the like. Further, the memory 20 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 30, which is a control center of the system, connects various parts of the entire television using various interfaces and lines, and performs various functions of the system and processes data by operating or executing software programs and/or modules stored in the memory 20 and calling data stored in the memory 20, thereby monitoring the system as a whole. Processor 30 may include one or more processing units; alternatively, the processor 30 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 30.

Although not shown in fig. 4, the system may further include a circuit control module for connecting to a power source to ensure proper operation of other components. Those skilled in the art will appreciate that the system architecture shown in FIG. 4 does not constitute a limitation on the system, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The invention also proposes a computer-readable storage medium on which a computer program is stored. The computer-readable storage medium may be the Memory 20 in the system of fig. 4, and may also be at least one of a ROM (Read-Only Memory)/RAM (Random Access Memory), a magnetic disk, and an optical disk, and the computer-readable storage medium includes instructions for enabling a terminal device (which may be a television, an automobile, a mobile phone, a computer, a server, a terminal, or a network device) having a processor to execute the method according to the embodiments of the present invention.

In the present invention, the terms "first", "second", "third", "fourth" and "fifth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, and those skilled in the art can understand the specific meanings of the above terms in the present invention according to specific situations.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although the embodiment of the present invention has been shown and described, the scope of the present invention is not limited thereto, it should be understood that the above embodiment is illustrative and not to be construed as limiting the present invention, and that those skilled in the art can make changes, modifications and substitutions to the above embodiment within the scope of the present invention, and that these changes, modifications and substitutions should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for adaptively adjusting images and sounds in a scene, the method comprising the steps of:

sending the scene recognition result to the video source terminal mapped by the current display picture so that the video source terminal adjusts image parameters and sound parameters according to the scene recognition result;

the step of sending the scene recognition result to the video source terminal mapped by the current display picture so that the video source terminal adjusts the image parameter and the sound parameter according to the scene recognition result comprises the following steps:

2. The method for adaptively adjusting images and sounds according to claim 1, wherein the step of obtaining the current display frame further comprises:

collecting sample images of different classes from a sample library;

and training the deep learning model according to the sample images of different categories to obtain the deep learning model after training.

3. The method for adaptively adjusting images and sounds according to claim 2, wherein the step of training the deep learning model according to the different types of sample images comprises:

and taking the sample image as the input of the deep learning model, outputting predicted scene label data after the deep learning model is operated, and training the deep learning model according to the predicted scene label data and the scene label data corresponding to the sample image to obtain the trained deep learning model.

4. The method as claimed in claim 3, wherein the step of taking the sample image as an input of the deep learning model, outputting predicted scene label data after running the deep learning model, and training the deep learning model according to the predicted scene label data corresponding to the sample image to obtain a trained deep learning model comprises:

5. The method of claim 1, wherein the step of recognizing the current display frame through the trained deep learning model to obtain the scene recognition result matching the current display frame comprises:

6. The method for adaptively adjusting images and sounds according to claim 1~5, wherein said step of sending the scene recognition result to the video source terminal mapped by the current display frame to make the video source terminal adjust the image parameters and the sound parameters according to the scene recognition result further comprises:

7. The method for adaptively adjusting images and sounds according to claim 6, wherein the step of sending the scene recognition result to an audio output device so that the audio output device adjusts the sound parameters according to the scene recognition result comprises:

and sending the acquired scene recognition result to audio output equipment so that the audio output equipment acquires the current sound parameter and the adjusting parameter corresponding to the scene recognition result in the preset audio mapping relation table corresponding to the scene recognition result to adjust the sound parameter according to the adjusting parameter.

8. A scene adaptive image and sound system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the method of scene adaptive image and sound as claimed in any one of claims 1 to 7.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for scene adaptive adjustment of images and sounds according to any one of claims 1 to 7.