CN118260380A

CN118260380A - Processing method and system for multimedia scene interaction data

Info

Publication number: CN118260380A
Application number: CN202410310846.9A
Authority: CN
Inventors: 颜海鹰; 颜思威
Original assignee: Xi'an Longteng Technology And Culture Co ltd
Current assignee: Xi'an Longteng Technology And Culture Co ltd
Filing date: 2024-03-19
Publication date: 2024-06-28

Abstract

The present invention relates to the field of multimedia interaction technologies, and in particular, to a method and a system for processing multimedia scenario interaction data. The method comprises the following steps: the method comprises the steps of applying an asynchronous data stream acquisition strategy to intelligent terminal equipment, and carrying out asynchronous interaction data acquisition on a user to obtain initial multimedia interaction data; asynchronous deconstructing is carried out on the initial multimedia interaction data to respectively obtain interaction environment tag data, user interaction state data and fusion emotion perception data; carrying out emotion pattern analysis according to the fused emotion perception data to generate corrected emotion pattern data; and carrying out intelligent personalized interaction strategy matching on the interaction environment tag data and the user interaction state data by correcting the emotion pattern data, generating an intelligent personalized interaction strategy, and carrying out personalized pushing based on the intelligent personalized interaction strategy to obtain intelligent pushing scene data. The invention realizes personalized scene interaction through dynamic emotion perception.

Description

Processing method and system for multimedia scene interaction data

Technical Field

The present invention relates to the field of multimedia interaction technologies, and in particular, to a method and a system for processing multimedia scenario interaction data.

Background

The wide application of the mobile internet and the internet of things enables people to access and use multimedia content through mobile phones, tablet computers, intelligent wearable devices and other terminal devices at any time and any place. This trend not only faces the collection and processing of multimedia scene interaction data to a wider range of application scenes, but also results in a dramatic increase in user-generated multimedia content. Processing these massive user-generated multimedia content requires not only efficient algorithms to cope with large-scale data, but also the ability to provide personalized services based on user interactions. However, the conventional processing method of multimedia scene interaction data generally adopts a serial processing mode, which has low efficiency for large-scale data processing, and particularly cannot meet the requirements for processing speed under the conditions of processing high-definition video, large-scale audio files and the like. Meanwhile, the traditional method is often limited to shallow-level feature extraction, lacks deep understanding and modeling of multimedia data back semantics, and limits high-level understanding and reasoning capability of scenes.

Disclosure of Invention

Based on the above, the present invention provides a method and a system for processing multimedia scene interaction data, so as to solve at least one of the above technical problems.

In order to achieve the above purpose, a method for processing multimedia scene interaction data includes the following steps:

Step S1: selecting an event-driven model of the intelligent terminal equipment to obtain event-driven model data; real-time response of the acquisition event is carried out according to the event-driven model data, so that an asynchronous data stream acquisition strategy is generated; the method comprises the steps that an asynchronous data stream acquisition strategy is applied to intelligent terminal equipment, asynchronous interaction data acquisition is conducted on a user based on preset initial interaction task data, and initial multimedia interaction data are obtained;

step S2: asynchronous deconstructing is carried out on the initial multimedia interaction data to respectively obtain interaction environment tag data, user interaction state data and fusion emotion perception data;

Step S3: extracting emotion characteristics according to the fused emotion perception data to respectively obtain text emotion type data and voice emotion data; carrying out emotion pattern analysis through text emotion type data and voice emotion data to generate corrected emotion pattern data;

Step S4: performing interactive environment recognition according to the interactive environment label data to generate interactive environment data; performing interactive psychological analysis on the user interaction state data through correcting the emotion pattern data to obtain user interaction psychological data; performing intelligent personalized interaction strategy matching on the user interaction psychological data through the interaction environment data to generate an intelligent personalized interaction strategy;

Step S5: carrying out context demand analysis according to the initial multimedia interaction data to obtain user interaction demand data; carrying out dynamic scenario demand reasoning according to the user interaction demand data to generate dynamic emotion demand data; and generating a multimedia pushing scene according to the dynamic emotion demand data, and performing personalized pushing according to the intelligent personalized interaction strategy, so as to obtain intelligent pushing scene data.

By selecting a proper event driving model, the invention can effectively process various events generated on the intelligent terminal equipment and improve the response speed and the resource utilization efficiency of the system. The real-time response of the event and the generation of the asynchronous data stream acquisition strategy can ensure that the capturing and processing of the event by the system can be effectively performed in time. The collection strategy is applied to the equipment, so that effective management and collection of equipment data can be realized. This helps to increase the scalability and flexibility of the system, enabling it to accommodate changes in different scenarios and requirements. By asynchronously deconstructing the initial multimedia interactive data, complex multimedia data can be effectively decomposed and processed, which is helpful for extracting key information and features in the data. By extracting the interactive environment tag data, the behavior and preference of the user under different environments can be known, so that data support is provided for applications such as personalized recommendation and environment adaptability adjustment. This helps to improve the system's perceptibility of the user's environment, providing more personalized and realistic services to the user. The emotion characteristics in the fused emotion perception data are extracted, so that emotion content expressed by a user in the interaction process can be deeply analyzed. The emotion state of the user can be more comprehensively understood by acquiring the text emotion type data and the voice emotion data, so that the emotion change and the demand of the user can be better perceived by the system, and the accuracy and the quality of emotion interaction are improved. through analyzing text emotion type data and voice emotion data, emotion modes and trends of users can be identified, and emotion expression modes and change rules of the users under different situations are known. The emotion pattern analysis can help the system to better understand emotion requirements and attitudes of users, and data support is provided for emotion-driven interaction. This helps the system to better personalize the user experience, improving interaction effects and user satisfaction. By identifying the interactive environment tag data, the system can accurately know the environment characteristics of the current user, such as workplaces, home environments, outdoors and the like, so that intelligent interactive service adjustment is performed according to different characteristics of the environment. The system is helpful for providing services which are more fit with the actual situation for the user according to the environmental characteristics, and improving the pertinence and the individuation degree of interaction. By analyzing the corrected emotion pattern data and the user interaction state data, the system can deeply understand the psychological state and emotion experience of the user in the interaction process, including emotion preference, emotion stability and the like, better understand the emotion demand and psychological state of the user and provide data support for personalized interaction. By combining the interactive environment data and the user interactive psychological data, the system can intelligently match the interactive strategy suitable for the current environment and the user psychological state, and personalized interactive service is realized, so that the system can perform intelligent service adjustment better according to the emotion and the environment characteristics of the user. Based on the dynamic emotion demand data, the system can generate multimedia push scenes for different users, including text, pictures, videos and other forms. Through intelligent personalized interaction strategy, the system can push according to personalized requirements and preferences of users, and the relevance and attractiveness of push content are improved. Therefore, the processing method of the multimedia scene interaction data carries out asynchronous data stream acquisition on the multimedia interaction data through the event-driven model, considers the multi-type processing efficiency in the multimedia data, and carries out asynchronous deconstructing processing on the multimedia data so as to improve the large-scale data processing efficiency. And customizing the personalized interaction mode according to the interaction environment and the reaction state of the interaction behavior of the user to obtain an intelligent personalized interaction strategy by intelligently identifying the emotion state of the interaction user. Further, an emotion demand reasoning model is built through emotion index data of the user, dynamic scene demand reasoning is conducted through the emotion demand reasoning model, pushing is conducted according to an intelligent personalized interaction strategy, and therefore personalized dynamic scene interaction is achieved.

Preferably, step S1 comprises the steps of:

Step S11: defining an acquisition data stream of the intelligent terminal equipment to generate acquisition data stream data;

Step S12: performing relay role design according to the acquired data stream data to obtain relay role data; asynchronous logic hierarchical division is carried out on the collected data stream data through the relay role data, and asynchronous data stream data are generated;

step S13: selecting an event-driven model according to the asynchronous data stream data to obtain event-driven model data;

Step S14: based on preset event triggering condition data, utilizing event-driven model data to conduct real-time response of acquisition events, and conducting multimedia data format standardization processing, so that an asynchronous data stream acquisition strategy is obtained;

Step S15: and applying an asynchronous data stream acquisition strategy to the intelligent terminal equipment, and carrying out asynchronous interaction data acquisition on the user based on preset initial interaction task data to obtain initial multimedia interaction data.

The invention defines the data flow of the intelligent terminal equipment, and ensures that various data can be accurately and efficiently acquired. By defining the data flow, the key information such as the data type, format, acquisition frequency and the like which need to be collected can be clearly defined. The collected data streams can be better organized and managed by relay role design and asynchronous logic hierarchy division. The relay role design can help determine the flow path and processing manner of the data stream in the whole system, thereby ensuring that the data can be transmitted and processed as expected. By selecting a proper event driven model, the asynchronous data stream can be better processed, and the real-time performance and efficiency of the system are improved. The event-driven model can effectively process various events in the data stream, and corresponding response and processing are performed according to the occurrence of the events. And carrying out real-time response of the acquisition event by utilizing event-driven model data based on preset event triggering condition data, and carrying out multimedia data format standardization processing. This helps to ensure that the collected data can be effectively transmitted and shared between different devices and platforms, and improves the availability and accessibility of the data. And applying the formulated asynchronous data stream acquisition strategy to the intelligent terminal equipment, and carrying out asynchronous interaction data acquisition on the user according to the preset initial interaction task data. This ensures that the data collected matches the actual interaction behavior and needs of the user.

Preferably, step S2 comprises the steps of:

Step S21: dividing data types of the initial multimedia interaction data to respectively obtain initial interaction video data, initial interaction audio data, initial interaction text data and user interaction operation data;

Step S22: extracting interactive video features of the initial interactive video data to respectively obtain key frame data, interactive object expression feature data and interactive environment label data;

Step S23: performing user interaction state analysis on the user interaction operation data through the key frame data to generate user interaction state data;

Step S24: carrying out emotion vocabulary recognition on the initial interactive text data to generate emotion text vocabulary data;

step S25: and performing audio frequency spectrum processing according to the initial interactive audio data to generate audio frequency spectrum data.

In the process of dividing the data types of the initial multimedia interactive data, the invention can accurately identify different types of data, including video, audio, text and user interactive operation data, and is beneficial to adopting different processing strategies for different types of data, thereby realizing the asynchronous deconstructing effect and improving the processing efficiency of mass data. The interactive video feature extraction is carried out on the initial interactive video data, so that key information in the video can be comprehensively and accurately extracted, wherein the key information comprises images at key moments, emotion expression of interactive objects, environment background where interaction occurs and the like. Such information is of great importance for understanding user behavior, emotional state, and interaction environment. The user interaction state analysis is carried out on the user interaction operation data through the key frame data, so that the interaction state of the user in the video interaction process can be accurately identified, wherein the interaction state comprises actions, behaviors and emotion changes of the user. Such analysis helps to understand the behavior patterns and emotional changes of the user during the interaction. Emotion vocabulary recognition is performed on the initial interactive text data, so that vocabularies with emotion colors, including positive, negative or neutral words, can be recognized from the text data. Such identification can help the system understand the emotional tendency expressed by the user during the interaction. The audio frequency spectrum processing is carried out according to the initial interactive audio data, so that the frequency spectrum information comprising the characteristics of frequency, intensity and the like of sound can be extracted from the audio data. Such processing aids the system in capturing important information in audio, such as speech content, emotion expressions, etc.

Preferably, step S22 comprises the steps of:

Step S221: performing video frame sequence processing on the initial interactive video data through preset frame sampling rate data to generate initial video frame sequence data; constructing an initial picture structure according to the initial video frame sequence data to generate initial frame picture structure data;

Step S222: performing frame difference calculation according to the initial video frame sequence data to generate frame difference data;

step S223: performing optical flow field calculation on the initial video frame sequence data through the frame differential data to obtain optical flow field data;

Step S224: according to the optical flow field data, carrying out feature separation on the moving object to respectively obtain expression feature data of the interactive object and interactive environment label data;

step S225: setting the inter-frame association strength of the initial video frame sequence data through the optical flow field data to obtain inter-frame association strength data; carrying out graph edge weight distribution on the initial frame graph structure data through the inter-frame association strength data to obtain video frame graph structure data;

Step S226: and carrying out graph centrality analysis on the video frame graph structure data, extracting key frames, and generating key frame data.

The invention processes the video frame sequence of the initial interactive video data through the preset frame sampling rate data, ensures that the video data is processed according to a certain sampling rate, ensures the continuity and the integrity of the data, and simultaneously provides a reliable data base for subsequent analysis. For example, if the video data frame rate is too high, the amount of data can be reduced by adjusting the sampling rate, improving the processing efficiency. And carrying out frame difference calculation according to the initial video frame sequence data, extracting difference information between video frames, and helping to identify moving objects and scene changes in the video, so that more accurate feature extraction and analysis are carried out, and the calculated amount of subsequent processing can be reduced and the efficiency of an algorithm can be improved through the frame difference calculation. The optical flow field calculation is performed on the initial video frame sequence data through the frame differential data, so that motion information between video frames, including the speed and the direction of an object, is captured, and the dynamic change in the video is facilitated to be understood. The method comprises the steps of carrying out characteristic separation of a moving object according to optical flow field data, and effectively separating the moving object from the environment in a video, so that key characteristic information is extracted. Interactive object expression feature data may be used for emotion analysis and user behavior recognition, while interactive environment tag data may provide background information of the video capture environment. The method comprises the steps of setting the inter-frame association strength and distributing the graph edge weight of initial video frame sequence data through optical flow field data, wherein the method is used for establishing the association relationship between video frames, and a foundation is provided for the structural representation of video. The inter-frame association strength data and the video frame diagram structure data can be used for tasks such as video retrieval, content recommendation and correlation analysis, and the utilization efficiency and the application value of the video data are improved. The method performs the diagram centrality analysis on the video frame diagram structure data, extracts the key frames with the most representation and information content from the video data, is beneficial to summarizing and abstracting the video content, reduces the redundancy of the video data, and improves the compression and transmission efficiency of the data.

Preferably, step S224 includes the steps of:

Step S2241: moving object separation is carried out according to the optical flow field data, and moving object data and moving background data are respectively obtained;

Step S2242: object region refinement processing is carried out on the moving object data to generate object contour data;

step S2243: performing face recognition according to the object contour data to obtain face recognition data;

Step S2244: carrying out expression intensity feature analysis on the face recognition data to generate interactive object expression feature data;

step S2245: performing background differential calculation on the motion background data to generate background environment differential data;

Step S2246: performing light intensity analysis on the background environment difference data to generate environment light intensity data;

step S2247: and carrying out color distribution detection according to the ambient light intensity data, and carrying out background object recognition so as to obtain interactive environment label data.

The invention separates moving objects through the optical flow field data, can effectively separate the moving objects from the static background in the video, is beneficial to identifying main moving objects in the video, extracts key information and reduces the complexity of subsequent processing. The object region refinement processing is performed on the moving object data, so that the region outline of the moving object can be refined, the boundary definition and accuracy of the object are improved, and the characteristic information of the object is extracted more accurately. Face recognition is carried out according to the object contour data, so that faces in videos can be accurately recognized, relevant information of the faces is extracted, and characteristics such as facial expression and emotion are further analyzed. The face recognition data is subjected to expression intensity and characteristic analysis, so that the intensity and the characteristic of the face expression can be accurately analyzed, and the emotion state of a user can be captured more accurately. The background difference calculation is carried out on the motion background data, so that the background information in the video can be accurately extracted, the motion object and the static background can be separated, and the complexity of subsequent processing is reduced. The light intensity analysis is performed on the background environment difference data, so that the light intensity change in the video background can be analyzed, which is helpful for understanding the light condition of the video shooting environment. The color distribution detection is carried out according to the ambient light intensity data, and the background object recognition is carried out, so that various objects in the video background can be accurately recognized, and the understanding and analysis of the video shooting scene are facilitated.

Preferably, step S23 comprises the steps of:

step S231: performing time axis association on the user interaction operation data through the key frame data to generate associated interaction event data;

Step S232: user interaction hot spot detection is carried out according to the associated interaction event data, so that key interaction event data are obtained;

Step S233: classifying the interaction types of the key interaction event data to generate interaction classification data;

Step S234: performing interactive response time analysis on the key interactive event data through the interactive action classification data based on a preset standard interactive action response threshold value to generate interactive response time data;

Step S235: and marking the user interaction state in real time through the interaction reaction time data to generate user interaction state data.

The invention carries out time axis association on the user interaction operation data through the key frame data, and is characterized in that the interaction operation of the user can be associated with the key frame of the video, thereby determining the time point when the user operation happens and accurately understanding the operation behavior of the user in the video. User interaction hot spot detection is carried out according to the associated interaction event data, so that key interaction events in the video, namely high-frequency occurrence areas or key scenes of user interaction, can be identified, and attention focuses and interest points of users can be understood. The key interaction event data is subjected to interaction action type classification, namely interaction behaviors of the user can be classified, such as clicking, sliding, playing and the like, so that the interaction mode of the user is better understood, and the deep mining of the behavior characteristics of the user is facilitated. Based on a preset standard interaction action reaction threshold value, the interaction action classification data is used for carrying out interaction reaction time analysis on the key interaction event data, so that the reaction time of a user on a specific action can be determined. This helps to evaluate the speed of the user's response to different interaction events, and thus better understand the user's interaction behavior characteristics. The real-time interaction state of the user is marked through the interaction reaction time data, so that the interaction state of the user can be monitored and recorded in real time. This helps to discover the behavior changes and demand changes of the user in time, providing timely response and personalized services to the system.

Preferably, step S3 comprises the steps of:

Step S31: carrying out emotion word intensity calculation on emotion text vocabulary data to generate emotion word intensity data; carrying out emotion type recognition according to emotion word intensity data to generate text emotion type data;

step S32: voice characteristic extraction is carried out on the audio frequency spectrum data through the Mel frequency cepstrum coefficient, and voice characteristic data are generated;

Step S33: carrying out audio emotion recognition on voice characteristic data by using a preset support vector machine model so as to obtain voice emotion data;

Step S34: carrying out emotion pattern analysis according to emotion text type data and voice emotion data to generate emotion pattern data;

Step S35: and carrying out emotion mode correction processing on the emotion mode data by using the expression characteristic data of the interactive object, and generating corrected emotion mode data.

According to the invention, the emotion word intensity calculation is carried out on emotion text vocabulary data, so that the intensity of emotion words contained in a text can be quantized, and the emotion tendency of the text is reflected. This facilitates a deep understanding of the emotion colors of the text, and emotion type recognition based on emotion word intensity data can separate the text into different emotion categories, such as positive, negative or neutral, further enhancing the understanding of emotion tendencies of the text. The voice characteristic extraction is carried out on the audio frequency spectrum data through the Mel frequency cepstrum coefficient, and the characteristic in the audio frequency data, such as the information of the frequency spectrum characteristic, intonation, rhythm and the like of the sound, is extracted. This helps to convert the audio data into quantifiable feature vectors. And carrying out audio emotion recognition on the voice characteristic data by using a preset support vector machine model, and recognizing emotion information such as happiness, anger, funeral and the like contained in the audio through the model, so that the emotion tendency in the audio can be accurately judged. And carrying out emotion pattern analysis according to the emotion text type data and the voice emotion data, comprehensively considering emotion information in the text and the audio to form a comprehensive emotion pattern, and being beneficial to understanding the emotion state of the user. And carrying out emotion mode correction processing on the emotion mode data by utilizing the expression characteristic data of the interactive object, and comprehensively considering the expression information of the interactive object, so that the emotion mode is more accurately adjusted, and the accuracy and the reliability of the emotion mode are improved.

Preferably, step S4 comprises the steps of:

Step S41: performing interactive environment recognition according to the interactive environment label data to generate interactive environment data;

Step S42: performing interactive psychological analysis on the user interaction state data through correcting the emotion pattern data to obtain user interaction psychological data;

step S43: performing intelligent interaction strategy matching on the user interaction psychological data through the interaction environment data to generate an intelligent interaction strategy;

step S44: acquiring historical user scenario interaction data;

step S45: and performing personalized adjustment on the intelligent interaction strategy by using the historical user scenario interaction data to generate an intelligent personalized interaction strategy.

According to the invention, the interactive environment is identified according to the interactive environment label data, and the current interactive environment can be accurately determined, including information such as light conditions, background object types, positions and the like. The intelligent interaction strategy establishment and adjustment of the system are facilitated according to the characteristics of the environment, and user experience and interaction effect are improved. The interaction psychological analysis is carried out on the user interaction state data by correcting the emotion pattern data, so that the interaction psychological state of the user in a specific emotion pattern can be understood deeply. Through correction and analysis of the emotion pattern data, the emotion and psychological state of the user can be grasped more accurately. The intelligent interaction strategy matching is carried out on the user interaction psychological data through the interaction environment data, and personalized and intelligent interaction service can be provided for the user according to different characteristics of the environment. By combining the interaction environment data with the user interaction psychological data, the system can more accurately match with a proper interaction strategy, and user satisfaction and interaction effect are improved. Historical user scenario interaction data are acquired, a large amount of user interaction behavior data can be accumulated, and preferences and behavior patterns of users can be mined from the historical user scenario interaction data. The historical data can be used as an important basis for adjusting the intelligent interaction strategy, so that a system is helped to better understand the requirements and behavior habits of users, and the individuation and intelligent level of interaction is improved. The intelligent interaction strategy is subjected to personalized adjustment by utilizing the historical user scenario interaction data, and the intelligent interaction strategy can be subjected to fine adjustment according to the personalized requirements and preferences of the user. By analyzing the historical data, the system can better know the behavior mode and preference of each user, so that personalized interaction service which meets the requirements of the user is provided for the user, and the user satisfaction degree and interaction effect are improved.

Preferably, step S5 comprises the steps of:

step S51: carrying out comprehensive feature weighting processing on the corrected emotion mode data, and carrying out emotion state index calculation to generate emotion state index data;

step S52: carrying out context demand analysis according to the initial multimedia interaction data to obtain user interaction demand data;

step S53: constructing an emotion demand reasoning model through emotion state index data based on the convolutional neural network model;

step S54: transmitting the user interaction demand data to a emotion demand reasoning model to conduct dynamic scenario demand reasoning, and generating dynamic emotion demand data;

Step S55: performing interactive recommendation content mining according to the dynamic emotion demand data to generate interactive recommendation content data;

Step S56: and generating a multimedia pushing scene for the interactive recommendation content data by using the generated countermeasure network, and carrying out personalized pushing according to the intelligent personalized interaction strategy, so as to obtain intelligent pushing scene data.

The invention carries out comprehensive feature weighting processing and emotion state index calculation on the corrected emotion mode data, can comprehensively consider various emotion features, and combines the emotion features into emotion state index data in a weighting way. Such processing can more accurately reflect the emotional state of the user. And carrying out context demand analysis according to the initial multimedia interaction data, so that the interaction demand and preference of the user can be deeply understood. By analyzing the multimedia interaction behavior of the user, the system can grasp the interest points, the favorites and the habits of the user. The emotion demand reasoning model is built through emotion state index data based on the convolutional neural network model, a deep learning model can be built, and automatic reasoning on the emotion demands of the user is achieved through learning the characteristics of the emotion state index data. The model can fully mine potential rules of the emotion state index data, and provides an efficient and accurate analysis tool for dynamic scene demand reasoning. The user interaction demand data is transmitted to the emotion demand reasoning model to conduct dynamic scenario demand reasoning, and dynamic emotion demand data can be generated in real time according to the current emotion state and demand of the user. Such data may better reflect the actual needs and emotional state changes of the user. According to the dynamic emotion demand data, the interactive recommendation content is mined, and the interactive recommendation content which accords with the emotion state of the user can be mined according to the emotion demand and the interest points of the user. The content mining can improve the satisfaction degree and the participation degree of the user and enhance the perception of the user on the interactive experience. The multimedia pushing scene generation is carried out on the interactive recommendation content data by utilizing the generation countermeasure network, and personalized pushing is carried out according to the intelligent personalized interaction strategy, so that the multimedia pushing scene can be dynamically generated according to the personalized requirements and the emotion states of the user, and personalized pushing experience is realized. The intelligent push scene can improve the participation degree and satisfaction degree of the user and enhance the interaction effect between the user and the system.

The invention also provides a processing system of the multimedia scene interaction data, which executes the processing method of the multimedia scene interaction data, and the processing system of the multimedia scene interaction data comprises the following steps:

The asynchronous data stream acquisition module is used for carrying out event-driven model selection on the intelligent terminal equipment to obtain event-driven model data; real-time response of the acquisition event is carried out according to the event-driven model data, so that an asynchronous data stream acquisition strategy is obtained; the method comprises the steps that an asynchronous data stream acquisition strategy is applied to intelligent terminal equipment, asynchronous interaction data acquisition is conducted on a user based on preset initial interaction task data, and initial multimedia interaction data are obtained;

The multimedia deconstructing module is used for carrying out asynchronous deconstructing processing on the initial multimedia interaction data to respectively obtain interaction environment tag data, user interaction state data and fusion emotion perception data;

The interactive emotion extraction module is used for extracting emotion characteristics according to the fused emotion perception data to respectively obtain text emotion type data and voice emotion data; carrying out emotion pattern analysis through text emotion type data and voice emotion data to generate corrected emotion pattern data;

The interaction strategy intelligent construction module is used for carrying out interaction environment identification according to the interaction environment label data to generate interaction environment data; performing interactive psychological analysis on the user interaction state data through correcting the emotion pattern data to obtain user interaction psychological data; performing intelligent personalized interaction strategy matching on the user interaction psychological data through the interaction environment data to generate an intelligent personalized interaction strategy;

The interactive scene generation module is used for carrying out context demand analysis according to the initial multimedia interactive data to obtain user interactive demand data; carrying out dynamic scenario demand reasoning according to the user interaction demand data to generate dynamic emotion demand data; and generating a multimedia pushing scene according to the dynamic emotion demand data, and performing personalized pushing according to the intelligent personalized interaction strategy, so as to obtain intelligent pushing scene data.

Drawings

Fig. 1 is a flow chart illustrating steps of a method for processing multimedia scene interaction data according to the present invention.

FIG. 2 is a detailed flowchart illustrating the implementation of step S1 in FIG. 1;

fig. 3 is a detailed implementation step flow diagram of step S3 in fig. 1.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following is a clear and complete description of the technical method of the present patent in conjunction with the accompanying drawings, and it is evident that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

Furthermore, the drawings are merely schematic illustrations of the present invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. The functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor methods and/or microcontroller methods.

It will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

In order to achieve the above objective, referring to fig. 1 to 3, the present invention provides a method for processing multimedia scene interaction data, comprising the following steps:

In the embodiment of the present invention, as described with reference to fig. 1, a step flow diagram of a method for processing multimedia scene interaction data according to the present invention is provided, and in the embodiment, the method for processing multimedia scene interaction data includes the following steps:

In the embodiment of the invention, a corresponding data acquisition scheme is formulated according to the requirement of acquiring the data stream, including determining the acquisition frequency, mode and data format. The information and types in the data stream are analyzed, and functions and tasks are designed, including data conversion, filtering, and aggregation. A relay program or service is developed and deployed into the system. And selecting a proper event driving model, and defining a preset event triggering condition. And establishing connection with the event driven model, monitoring the data flow in real time, immediately responding once the triggering condition occurs, and starting a data acquisition program. And carrying out format standardization treatment on the acquired data so as to ensure consistency. And determining an initial interaction task, configuring an asynchronous interaction data acquisition program, and triggering data acquisition when a user executes the task to obtain initial multimedia interaction data.

In the embodiment of the invention, asynchronous deconstructing processing is carried out on the initial multimedia interactive data so as to extract key information. First, interactive video data is decoded and key frames extracted to obtain important still pictures in the video. In the key frames, a deep learning model or a feature extraction algorithm is utilized to detect the human face and identify the expression features, so that the emotion state of the user is obtained. Meanwhile, an image processing technology or a deep learning model is utilized to analyze the video background, and interaction environment label data such as scene types, light conditions and the like are extracted. Next, user interaction state analysis, such as recognizing clicking, sliding, etc., is performed in combination with the user interaction operation data and the key frame data to infer the behavior and intention of the user. And then, carrying out emotion vocabulary recognition on the interactive text data by using a natural language processing technology so as to determine emotion tendencies in the text. Finally, the interactive audio data is subjected to spectrum analysis, spectrum characteristics are extracted, and for example, the frequency spectrum information of the audio is obtained by utilizing Fourier transformation. The whole process aims to asynchronously process different types of multimedia data to obtain interactive environment tag data, user interactive state data and fused emotion perception data.

In the embodiment of the invention, emotion intensity calculation is carried out by using emotion text vocabulary data, emotion polarity and degree are determined, emotion type identification is carried out on the text, and the text is judged to be positive, negative or neutral. The audio data is preprocessed, including pre-emphasis and framing, to calculate mel-frequency cepstral coefficients (MFCCs). And classifying and identifying the voice emotion data by using a trained Support Vector Machine (SVM) model. And matching the emotion text type data with the voice emotion data to ensure that each voice sample has a corresponding emotion label. And calculating the frequency distribution of the voice emotion corresponding to different emotion text types, and comparing the difference and the similarity. And carrying out pattern matching according to a preset emotion pattern data rule base to obtain optimally matched emotion pattern data. And matching the expression characteristic data of the interactive object with the emotion pattern data to ensure that each emotion pattern corresponds to the expression characteristic data of the interactive object. And analyzing the relation between the expression characteristics of the interactive object and the emotion modes, and correcting the emotion modes to reflect the actual emotion states of the interactive object.

In the embodiment of the invention, the interactive environment label data is utilized for environment recognition so as to generate the interactive environment data. For example, assume that the tag data indicates that the current environment is a "home environment," described as "bright," including sofas and televisions. And carrying out interactive psychological analysis on the user interaction state data by correcting the emotion pattern data. This involves in-depth analysis of the emotional state and interactive behavior of the user to infer the mental state of the user. For example, assuming a user uses voice interactions in a bright home environment, the emotion pattern data may be adjusted according to the environment and the actual behavior of the user, such as emotion tendencies that increase affinity. Then, based on these data, psychological analysis is performed to understand the emotion and attitude of the user. And matching the intelligent personalized interaction strategy according to the interaction environment data and the user interaction psychological data. This means that depending on the environment and the emotional state of the user, the appropriate interaction means and content are selected to provide a better user experience. For example, in a bright home environment, users tend to interact with higher affinity, and the intelligent interactive system may choose to use a more delicately relevant voice interaction and provide home-related content, such as home theater play or home activity advice.

In the embodiment of the invention, the initial multimedia interaction data is analyzed to know the requirements and expectations of users in different environments. User interaction demand data is extracted from the data, including interaction content, modes and experience demands. For example, viewing preferences control playback using voice commands. And deducing dynamic emotion requirements according to the user interaction requirement data. And establishing an emotion demand reasoning model by using a convolutional neural network and other models, and training by using marked emotion state index data. The inference model deduces dynamic emotion requirements, such as viewing content with happy and relaxed user preference, according to the emotion state and interaction requirement data of the user. And generating a multimedia push scene based on the dynamic emotion demand data. And generating a multimedia content scene conforming to the emotion state and the interaction environment of the user by using technologies such as generating an antagonism network and the like. For example, a movie is recommended to be easily fun according to the emotional state of the user, and a home environment scene suitable for viewing is generated. And carrying out personalized pushing on the generated push content according to the intelligent personalized interaction strategy. For example, a movie is recommended that is easy to fun and a home environment scene is generated that is suitable for viewing to enhance the user's viewing experience.

Preferably, step S1 comprises the steps of:

As an example of the present invention, referring to fig. 2, a detailed implementation step flow diagram of step S1 in fig. 1 is shown, where step S1 includes:

In the embodiment of the invention, a corresponding data acquisition scheme is formulated according to the acquired data flow requirement. This may include determining the frequency of acquisition, the manner of acquisition (e.g., real-time acquisition or timed acquisition), the format of the acquired data, and so forth. For example, for a smart phone, information such as the position, acceleration, etc. of the device can be collected in real time through a sensor.

In the embodiment of the invention, various information and data types contained in the acquired data stream data are analyzed according to the characteristics and the structure of the acquired data stream data. For example, for a smart home device, the data stream may include data from various sensors, such as temperature, humidity, illumination, and the like. Corresponding functions and tasks are designed according to the requirements of the relay roles, and the operations such as conversion, filtering, aggregation and the like of the data streams and interaction with other components can be involved. The relay role is implemented according to the designed functions and tasks. This may involve developing a corresponding relay program or service and deploying it into the system. For example, a background service program is developed to periodically acquire and process data from a data stream and then send the processing results to a downstream system.

In the embodiment of the invention, a proper event driven model is selected according to the characteristics of asynchronous data stream data and the system requirements. This may be weighted and selected based on factors such as the real-time nature, complexity, etc. of the data stream. For example, if the system needs to process large-scale real-time data streams and the accuracy requirements for the data are high, a APACHE FLINK or like streaming framework may be selected for use.

In the embodiment of the invention, a series of preset event triggering condition data needs to be defined so as to adapt to the monitoring requirement of the multimedia interaction data. These condition data may include events triggered by specific user operations, system state changes, or specific environmental conditions. For example, in a social media application, an event trigger condition may be that a user has posted a new post. And establishing connection with the event-driven model according to the preset event triggering condition data, and monitoring the multimedia interaction data stream in real time. This involves configuring the data source and setting the event listener to capture the trigger conditions in real time. For example, on a social media platform, a listener is set up to capture events of a user posting a new post. The event-driven model responds immediately to the occurrence of the trigger condition once it has been detected. This may include initiating a corresponding data acquisition program or service to capture multimedia interaction data associated with the event. For example, once a user has posted a new post, the system immediately initiates a data collection program to obtain the text, picture, or video content of the post. And carrying out format standardization processing on the acquired multimedia interaction data so as to ensure the consistency and usability of the data. This may include uniformly converting the data in different formats to a uniform format for subsequent processing and analysis. For example, text, pictures, and video content published by users are converted into a unified JSON format.

In the embodiment of the invention, an initial interaction task, namely a preset user interaction behavior or task is determined and used for triggering data acquisition. For example, an initial interaction task may be a user logging into an application or performing a particular operation. And configuring an asynchronous interaction data acquisition program or service according to the initial interaction task data so as to trigger data acquisition when a user executes a task. For example, in a social media application, a data collection program is configured to trigger data collection when a user posts a new post or comment. When a user executes a preset initial interaction task, the data acquisition program starts to acquire corresponding multimedia interaction data according to an asynchronous data stream acquisition strategy. For example, when a user posts a new post in a social media application, the data collection program begins capturing data such as post content, pictures, or videos.

Preferably, step S2 comprises the steps of:

In the embodiment of the invention, the initial multimedia interaction data is divided into different types of data according to different characteristics and formats of the data. For example, video data is identified as interactive video data, audio data is identified as interactive audio data, text content is identified as interactive text data, and user operation behavior is identified as user interactive operation data. The initial interactive video data is preprocessed, including video decoding, frame extraction, etc., for subsequent feature extraction. For example, the video uploaded by the user is decoded and key frames therein are extracted. And carrying out face detection and expression recognition on the interactive objects in the key frames, and extracting expression characteristic data of the interactive objects. A deep learning model or feature extraction based algorithm may be used to identify faces and analyze expressions. For example, a face appearing in a video is recognized, and expression features such as smile, anger, and the like are extracted to understand the emotional state of the user. And analyzing background information in the interactive video data, and extracting interactive environment label data such as scene types, light conditions and the like. Video background may be analyzed and identified using image processing techniques or deep learning models. For example, scene types (indoor, outdoor), light conditions (bright, dim), etc. in the video are identified to help understand the environmental context of the video. And the key frame data is used for carrying out analysis on the user interaction operation data in combination with the user interaction operation data, such as mouse clicking, touch screen and the like. For example, user interactions, such as clicks, swipes, etc., on a particular key frame are identified. And generating user interaction state data according to the analyzed user interaction operation data and by combining the time stamp of the key frame and other related information. For example, for a social media application, user interaction operations when browsing photos include click viewing, praise, comment, etc., and interaction state data of the user, such as browsing, praise, comment, etc., is generated by analyzing these operations and matching with content corresponding to a key frame. And carrying out emotion vocabulary recognition on the initial interactive text data by using a natural language processing technology so as to determine emotion tendencies contained in the text. For example, words in text are emotionally classified using an emotion dictionary or machine learning model, and positive, negative, or neutral emotion is identified therein. The initial interactive audio data is subjected to spectral analysis using digital signal processing techniques to extract spectral features of the audio. For example, the audio signal is converted into a frequency domain representation using fourier transformation, and spectral information of the audio is acquired.

Preferably, step S22 comprises the steps of:

In the embodiment of the invention, the frequency of extracting video frames from the initial interactive video data is determined according to the preset frame sampling rate data. For example, assume that the preset frame sampling rate is 10 frames per second. Video frame sequence data is extracted from the initial interactive video data according to the determined frame sampling rate. For example, if the frame rate of the video is 30fps, sampling is performed in such a manner that one video frame is extracted every 3 frames, resulting in video frame sequence data. And constructing an initial frame diagram structure according to the video frame sequence data. In the frame map structure, each video frame is represented as a node in the map, and edge connection exists between adjacent video frames. For example, for a video sequence comprising 10 video frames, a graph comprising 10 nodes is constructed, the nodes are connected in time sequence, i.e. with an edge between frame 1 and frame 2, an edge between frame 2 and frame 3, and so on. Frame differential computation is performed on the initial video frame sequence data to capture the variation between adjacent video frames. The frame difference calculation may employ various image processing techniques such as a mean square error method, an absolute difference method, and the like. And saving the difference data obtained by the frame difference calculation as frame difference data. These data may be used for subsequent tasks such as motion detection, dynamic object tracking, etc. For example, for a video sequence, moving objects or regions of change occurring in the video may be identified based on data calculated from frame differences. And calculating the motion condition of each pixel point in the video sequence by using a light flow method by utilizing the frame differential data. Optical flow is a method of describing the motion of pixels between adjacent frames, and the optical flow field is obtained by calculating the motion vector of each pixel on the image. For example, optical flow field computation is performed on a sequence of video frames using optical flow algorithms such as the Lucas-Kanade method or the Horn-Schunck method. The optical flow field data is used to analyze moving objects in the video, such as characters, objects, etc., and extract their features. For example, by tracking pixel displacement in an optical flow field, a face, a gesture or other interactive object in a video can be identified, and their expression, action and other features can be extracted. Background information in the optical flow field data is analyzed, and interactive environment tag data, such as scene type, light conditions, and the like, is extracted. The optical flow field data is used to calculate the motion correlation strength, i.e., inter-frame correlation strength, between adjacent video frames. For example, by analyzing pixel displacement in the optical flow field, motion consistency or disparity between adjacent frames is calculated to determine inter-frame correlation strength. And according to the inter-frame association strength data, assigning weights to edges in the initial frame graph structure so as to reflect the association degree between video frames. And carrying out centrality analysis on the video frame graph structure data to determine key nodes in the graph. For example, indexes such as degree centrality, medium centrality, tight centrality and the like of the nodes are calculated, and important nodes in the graph are identified. Representative key frames are extracted from the sequence of video frames based on the results of the graph centrality analysis. For example, a node with higher centrality in the graph is selected as a key frame, or the key frame is selected according to the centrality of the betweenness of the node, so as to keep the most informative and representative frame image in the video.

Preferably, step S224 includes the steps of:

In the embodiment of the invention, for the optical flow field data of each frame, pixel points in an image are divided into two types according to the motion condition of the pixels: pixels belonging to a moving object and pixels belonging to a stationary background. According to the pixel motion vector in the optical flow field data, pixels are separated by using a threshold value, a clustering method and the like, and the pixels of the moving object are distinguished from the pixels of the static background. For example, for moving objects (e.g., people or vehicles) in a video, the optical flow field data shows that their pixels change over time, while the stationary background remains unchanged. For moving object data, an area composed of adjacent pixels is identified as an area of a moving object by using an image processing technique (such as connected domain analysis). And carrying out edge detection or contour extraction on the region of the moving object to obtain the accurate boundary of the object. For example, a Canny edge detection algorithm or a contour detection algorithm is used for refining the region of the moving object to obtain the contour of the object. Using the object contour data, regions that may contain faces are identified in the moving object. And carrying out face recognition on the identified area possibly containing the face, and determining whether the area possibly containing the face is contained by comparing the known face model or the feature. And for the identified face area, performing expression recognition by utilizing an image processing or deep learning technology, and analyzing the intensity and type of the facial expression. For example, using an expression recognition model based on deep learning, expression categories of a face (such as happy, angry, surprise, etc.) are recognized. And carrying out intensity analysis on the identified expression, and determining the expression degree or intensity of the expression. For example, facial expressions are scored through a deep learning model to obtain the intensity or confidence of the expression. And analyzing the light intensity conditions of different areas in the image, including brightness and color changes, by using the background environment difference data. For example, by performing pixel-level brightness calculation on the differential image, the change in light intensity in the image can be determined. For example, the image is segmented into different regions and the average luminance value for each region is calculated as part of the ambient light intensity data. Based on the ambient light intensity data, the color distribution of different areas in the image is analyzed to identify possible background objects. For example, color distribution characteristics of different areas in an image are detected according to a method such as color histogram or cluster analysis. And identifying a background object in the image according to the result of the color distribution detection, and marking. The background object type (e.g., wall, floor, furniture, etc.) in the image and its location information in the image are marked.

Preferably, step S23 comprises the steps of:

In the embodiment of the invention, the key frame data and the user interaction operation data are associated in a time axis, and the key frame corresponding to each interaction event is determined. For example, for video browsing behavior in a social media application, operations such as praise, comment and the like of a user are associated with key frames of a video, and a time point when each operation occurs and a corresponding video frame are determined. And generating associated interaction event data according to the key frames and the associated interaction operation data, wherein the associated interaction event data comprises information such as time of each interaction event, corresponding video frame images and the like. For example, if the user clicks the like button on a certain video frame, the like operation is associated with the time of the video frame, and corresponding interactivity event data is generated. The associated interaction event data is analyzed to identify hot spot areas or events that have a higher frequency of user interaction or a greater impact. For example, in a social media application, if the interaction frequency of praise, comment, share, etc. of a video segment is high, the video segment may be marked as a user interaction hotspot. For a video, the interaction operations like praise, comment and the like in the hot spot area are identified as key interaction events, and the occurrence time and specific content of the interaction events are recorded. Operations of users in the social media application are classified according to functionality, such as praise operations, comment operations, sharing operations and the like. And classifying the key interaction event data according to the defined interaction action type standard, and classifying each event into a corresponding interaction action type. A standard reaction time threshold is set for each interaction, i.e. the maximum limit of the user's reaction time to a certain interaction is assumed. For example, assume that the standard reaction time threshold for setpoint praise is 5 seconds. The actual reaction time of each interaction is analyzed and compared with a set standard reaction time threshold to determine whether the user has reacted within a specified time. For example, for a praise operation in a certain video, the time from when the user clicks the praise button to when the actual praise operation is completed is recorded, and compared with a preset reaction time threshold to determine whether the user has completed the praise operation within a prescribed time. And judging the interaction state of the user according to the interaction reaction time data obtained through analysis, for example, judging whether the user is in an active state or an inactive state. For example, if the user completes the praise operation within a prescribed time, the user is marked as active; otherwise, the user is marked as inactive.

Preferably, step S3 comprises the steps of:

As an example of the present invention, referring to fig. 3, a detailed implementation step flow diagram of step S3 in fig. 1 is shown, where step S3 includes:

In the embodiment of the invention, emotion intensity calculation is carried out on each word in emotion text vocabulary data. The emotion word strength may be determined based on the emotion polarity and degree of the vocabulary, and is typically calculated using an emotion dictionary or machine learning algorithm. For example, for the word "like", the emotion strength of the word "like" may be determined to be positive and high according to an emotion dictionary or emotion analysis algorithm. And identifying the emotion type of each text according to the emotion word intensity data. Emotion types generally include positive, negative, neutral, etc. For example, for a text containing positive emotion words such as "like", "happy" and the like, the emotion type can be judged as positive; and judging that the emotion types of the texts containing negative emotion vocabularies such as aversion, pain and the like are negative. And carrying out emotion intensity calculation on each word in the comment. For example, "weather" may be considered a neutral word, while "pleasurable" and "nice" are considered positive emotion words, with a higher intensity.

In the embodiment of the invention, the audio data is preprocessed, such as pre-emphasis processing, framing processing and the like, and the audio signal is divided into a plurality of time windows. A Fast Fourier Transform (FFT) is applied to each time window to convert the time domain signal to a frequency domain signal. The mel-frequency cepstrum coefficient (MFCC) of each time window is calculated, and discrete cosine transform is finally performed by performing convolution operation of a mel filter bank on the spectrogram, and then taking the logarithm. For example, the pre-processed audio data is framed, typically 20-30 milliseconds per frame. For example, the audio signal is divided into time windows of 20 milliseconds, each of which may have an overlap between them. An FFT algorithm is applied to each time window to convert the time domain signal into a frequency domain signal. The MFCC coefficients for each time window are calculated by a mel filter bank, typically 13 MFCC coefficients are selected as speech features. And performing DCT on the MFCC coefficients of each time window to obtain a final MFCC feature vector. And combining the MFCC feature vectors calculated by each time window to form a complete voice feature data sequence.

In the embodiment of the invention, training is required according to the existing voice emotion data set to construct a Support Vector Machine (SVM) model. This data set should contain speech samples that have been labeled with emotion categories, such as "happy", "sad", "angry", and so forth. The feature engineering method is used to extract the voice features in the training set and pre-process the features, such as normalization or standardization. Then, a support vector machine algorithm is used for model training, and a model capable of accurately classifying emotion is fitted through training data. And inputting the voice characteristic data to be recognized into a pre-trained support vector machine model, and performing emotion classification recognition. The support vector machine model classifies the input speech feature data into different emotion categories, such as "happy", "sad", "angry", etc.

In the embodiment of the invention, the emotion text type data and the voice emotion data are corresponding and matched, so that each voice sample is ensured to have a corresponding emotion label. And calculating the frequency distribution of the voice emotion corresponding to different emotion text types, and comparing the difference and the similarity between the emotion types. The distribution situation of the voice emotion under each emotion text type is counted, for example, the voice emotion in positive comments is mostly positive, and the voice emotion in negative comments is mostly negative. And then carrying out pattern matching in a preset emotion pattern data rule base to obtain the best matched emotion pattern data.

In the embodiment of the invention, the expression characteristic data of the interactive object is matched and correlated with the emotion pattern data, so that each emotion pattern corresponds to the expression characteristic data of the interactive object. And analyzing the relation between the expression characteristics of the interactive object and the emotion modes, and correcting the emotion modes according to the actual situation so as to reflect the actual emotion state of the interactive object. A series of rules or conditions are defined, and emotion patterns are corrected according to expression characteristic data of the interactive object. For example, if the smile level of the interaction subject is high, the corresponding emotion pattern may be adjusted to be a more positive emotion. For example, if a comment is marked negative, but the corresponding interactive object expression data shows that the object is highly happy, the emotion pattern of the comment may be modified to be more positive emotion.

Preferably, step S4 comprises the steps of:

step S44: acquiring historical user scenario interaction data;

In the embodiment of the invention, the light intensity data is analyzed, and classified according to a set threshold, for example, whether the light intensity is sufficiently bright or excessively dark is judged. For example, a light intensity greater than 5000 lumens may be judged as a bright environment and less than 2000 lumens may be judged as a dim environment. The image data is processed by computer vision technology, and the type and position information of the background object in the image are identified. Background objects such as walls, floors, furniture, etc. in the image are detected using a target detection algorithm, such as YOLO or fast R-CNN. For example, it is detected that one wall in the image is located on the left side of the image and one table is located in the center of the image. And adjusting emotion mode data according to the light conditions of the environment and the background object information, and correcting the emotion state of the user. And comprehensively considering the emotion state, interaction action and interaction environment information of the user, and analyzing the interaction psychological state of the user. For example, users may prefer higher affinity and comfort interactions in a warm home environment. And selecting proper interaction modes and contents in consideration of the emotion states and environmental characteristics of the user. For example, in a warm home environment, a warm and intimate voice interaction mode is selected. Based on the interaction environment data, an intelligent interaction strategy is generated by using a preset algorithm or model, wherein the strategy can comprise different interaction modes and contents aiming at different light conditions and background object types. Assuming that the interactive environment data shows that the current environment has darker light, and sofas and televisions are arranged in the background, the intelligent interaction strategy can choose to remind the user to turn on the light or provide a brighter interaction interface through voice according to the information so as to improve the user experience. And collecting and arranging scene interaction data of the historical user, including information such as interaction behavior, emotion feedback and the like. And utilizing the historical user scenario interaction data to perform personalized adjustment on the intelligent interaction strategy generated before. And utilizing the historical user scenario interaction data to perform personalized adjustment on the intelligent interaction strategy generated before. And analyzing the preference and habit of the user according to the historical data, and adjusting the interaction strategy to improve the satisfaction degree and interaction effect of the user. For example, it is found from historical user context interaction data that some users prefer to use voice interactions in darker environments and touch interactions in bright environments. Thus, the intelligent interaction policies of these users are personalized to better meet their preferences.

Preferably, step S5 comprises the steps of:

In the embodiment of the invention, the modified emotion mode data is subjected to comprehensive feature weighting processing, and different emotion features are subjected to weighted combination so as to comprehensively reflect the emotion state of the user. For example, the weighting process is performed according to the weights of the positive emotion, the negative emotion and the neutral emotion in the emotion pattern data. For example, the modified emotion pattern data is subjected to comprehensive feature weighting processing, such as weighting and summing the emotion pattern data according to the weights of the positive emotion and the negative emotion, so as to obtain a comprehensive emotion index. Based on these comprehensive emotion indexes, an emotion state index is calculated by a specific algorithm, for example, emotion intensity score is 0.8, and emotion polarity is positive. And carrying out context demand analysis on the initial multimedia interaction data to know the demands and expectations of users in different environments. And obtaining user interaction demand data according to the result of the contextual demand analysis, wherein the user interaction demand data comprises the demands of users on interaction content, interaction modes and interaction experience. For example, by performing a context demand analysis on the initial multimedia interaction data, it is found that the user is more inclined to control playing using voice commands while watching a movie, and thus such demand information is included in the user interaction demand data. Modeling emotion state index data by using a Convolutional Neural Network (CNN) and other models, and constructing an emotion demand reasoning model. And training the emotion demand reasoning model by using the marked emotion state index data as a training set so as to learn the mapping relation between the emotion state and the user interaction demand. And the emotion demand reasoning model is used for reasoning the dynamic scenario demands by utilizing the learned modes and relations based on the input user interaction demand data. For example, based on the emotional state of the user and the interaction demand data, it is inferred that the user may prefer happy, relaxed movie content while watching the movie. And mining interactive recommendation contents according to the generated dynamic emotion demand data. For example, movies, music, or other entertainment content that is suitable for viewing by the user is mined in conjunction with the emotional needs of the user and the current interaction environment. Multimedia push scenario generation is performed on interactive recommendation content data using techniques such as generation of a countermeasure network (GAN). For example, a GAN model is utilized to generate multimedia content scenes that match the user's current emotional state and interactive environment. And carrying out personalized pushing on the generated push content according to the intelligent personalized interaction strategy. For example, the user is watching a comedy movie and the emotional need inference model analyzes the user's emotional state as pleasant and relaxed. Based on these emotional demand data, the interactive recommendation content mining system recommends a movie that is easy to fun. Then, a multimedia push scene suitable for watching is generated by using the generation countermeasure network, and the multimedia push scene comprises a warm home environment and a comfortable atmosphere, so that the watching experience of a user is enhanced.

The method has the beneficial effects that in the event processing of the intelligent terminal equipment, the selection of the proper event driving model is important to the improvement of the response speed of the system and the utilization efficiency of resources. By responding to the event in real time and generating an asynchronous data stream acquisition strategy, the system can ensure the capturing and processing of the event to be effective in time, thereby realizing the effective management and acquisition of the equipment data. This not only enhances the scalability and flexibility of the system, but also enables it to accommodate variations in different scenarios and requirements. The asynchronous deconstructing process of the initial multimedia interactive data is beneficial to effectively decomposing and processing complex multimedia data, and extracting key information and characteristics in the data. In addition, by extracting the interactive environment tag data, the system can know the behaviors and preferences of the user in different environments, and data support is provided for personalized recommendation and environment adaptability adjustment. The data analysis improves the perception capability of the system to the user environment, so that the service is more personalized and fits the actual situation. And the fusion and analysis of emotion perception data, in particular to the acquisition of text emotion type data and voice emotion data, so that the system can more comprehensively understand the emotion state of a user. The method not only improves the accuracy and quality of emotion interaction, but also helps the system to identify the emotion pattern and trend of the user and understand the emotion expression mode and change rule of the user under different situations. Emotion pattern analysis plays an important role in personalizing user experience, improving interaction effects and user satisfaction. The system can accurately know the environmental characteristics of the current user, such as workplaces, home environments, outdoors and the like, by identifying the interactive environment tag data, and intelligent interactive service adjustment is performed according to different characteristics of the environments. By combining the interactive environment data and the user interactive psychological data, the system can intelligently match the interactive strategy suitable for the current environment and the user psychological state, and personalized interactive service is realized. Finally, based on the dynamic emotion demand data, the system can generate multimedia push scenes for different users, including text, pictures, videos and other forms. The intelligent personalized interaction strategy enables the system to push according to personalized requirements and preferences of users, and promotes correlation and attraction of push content, so that user experience is personalized better, and interaction effect and user satisfaction are improved. The comprehensive application of the strategies and analysis tools provides powerful support for event processing and user interaction on intelligent terminal equipment, so that the system can serve users more intelligently and efficiently.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The processing method of the multimedia scene interaction data is characterized by being applied to intelligent terminal equipment, and comprises the following steps of:

2. The method for processing multimedia scene interaction data according to claim 1, wherein the step S1 comprises the steps of:

3. The method for processing multimedia scene interaction data according to claim 2, wherein the fused emotion perception data includes audio spectrum data, emotion text vocabulary data and interactive object expression feature data, and step S14 includes the steps of:

4. A method of processing multimedia scene interaction data according to claim 3, wherein step S22 comprises the steps of:

5. The method of claim 4, wherein the step S224 includes the steps of:

6. A method of processing multimedia scene interaction data according to claim 3, wherein step S23 comprises the steps of:

7. A method of processing multimedia scene interaction data according to claim 3, wherein step S3 comprises the steps of:

8. The method for processing multimedia scene interaction data according to claim 1, wherein the step S4 comprises the steps of:

step S44: acquiring historical user scenario interaction data;

9. The method for processing multimedia scene interaction data according to claim 1, wherein the step S5 comprises the steps of:

10. A processing system for multimedia scene interaction data, characterized in that it is used for executing the processing method for multimedia scene interaction data according to claim 1, the processing system for multimedia scene interaction data comprises: