CN118227894A

CN118227894A - Government affair item dialogue recommendation method based on multidimensional vector fusion

Info

Publication number: CN118227894A
Application number: CN202410413490.1A
Authority: CN
Inventors: 张煇; 杨勇; 赵宝金; 冯向阳
Original assignee: Changhe Information Co ltd; Beijing Changhe Digital Intelligence Technology Co ltd
Current assignee: Changhe Information Co ltd; Beijing Changhe Digital Intelligence Technology Co ltd
Priority date: 2024-04-08
Filing date: 2024-04-08
Publication date: 2024-06-21

Abstract

The application discloses a government affair item dialogue recommendation method based on multidimensional vector fusion, which relates to the technical field of government affair recommendation and comprises the following steps: collecting multi-modal data including user text, user behavior logs, images and audio signals; extracting text features by using a natural language processing method; extracting user behavior characteristics by using a statistical method; extracting image features by using a computer vision method; extracting audio features by using a digital signal processing method; normalizing the extracted features; adopting a multi-view representation learning algorithm, taking the normalized different types of features as different views, and acquiring weight coefficients of the different views through an objective function; obtaining a fused multi-modal feature representation according to the weight coefficient; and constructing a neural network model, and outputting a government affair item recommendation result. Aiming at the problem of low understanding precision of user intention in a government affair service dialogue system in the prior art, the application improves the recommendation precision of government affair matters.

Description

Government affair item dialogue recommendation method based on multidimensional vector fusion

Technical Field

The application relates to the technical field of government affair recommendation, in particular to a government affair item dialogue recommendation method based on multidimensional vector fusion.

Background

With the development of the Internet, various government service portals are gradually realizing 'one-net-through' and 'one-window acceptance', and convenience service is provided through an online platform, so that the masses can avoid tedious offline running, and the obtaining efficiency of government service is greatly improved. At the same time, the rise of intelligent conversation robots creates new possibilities for government service conversations. However, the current dialogue robot cannot accurately understand the service requirement of the user, and directly influences the quality of government affair consultation service. Therefore, the understanding capability of the dialogue system to the user intention is improved, personalized government service recommendation is realized, and the technology development direction is current.

In the existing government service dialogue system, the main problem of user intention understanding is the limitation of a single feature extraction method, and the behavior and expression of the user in multiple dimensions cannot be fully considered. For example, text-based methods may ignore visual and speech features of a user, thereby limiting understanding of the user's actual intent. In addition, due to the lack of comprehensive feature fusion mechanisms, the existing system may not fully mine the potential relevance of information when processing multi-modal data, resulting in low accuracy of recommendation results.

In the related art, for example, chinese patent document CN116595161a provides a government affair digital scene item recommendation method, apparatus, device and storage medium, wherein the method includes: acquiring a search text facing government scene matters; determining a first recommended scene item from each candidate government scene item based on semantic similarity between the subject words in the search text and the subject words of each candidate government scene item respectively; determining hit scene matters matched with the search text from the candidate government scene matters, counting the transaction main body frequency of the candidate government scene matters based on the transaction records of the target transaction main bodies of the hit scene matters, and determining second recommended scene matters from the candidate government scene matters based on the transaction main body frequency; and determining the recommended government affair scene item corresponding to the search text based on the first recommended scene item and the second recommended scene item. But this approach relies primarily on text for matching, resulting in an inaccurate and comprehensive understanding of the user's intent.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problem of low accuracy of understanding user intention in a government affair service dialogue system in the prior art, the application provides a government affair item dialogue recommendation method based on multidimensional vector fusion, which accurately captures the user intention and improves the recommendation accuracy of government affair items by comprehensively analyzing various interaction behaviors and contents of users.

2. Technical proposal

The aim of the application is achieved by the following technical scheme.

The embodiment of the specification provides a government affair item dialogue recommendation method based on multidimensional vector fusion, which comprises the steps of data input: multimodal data containing user text, user behavior logs, images and audio signals is collected as raw input to the system. Data preprocessing: preprocessing operations such as word segmentation, part-of-speech tagging and the like are carried out on the user text, and structured text data are obtained; carrying out preprocessing operations such as denoising, direction correction and the like on the image to obtain standardized image data; and performing preprocessing operations such as noise elimination, endpoint detection, framing and the like on the audio signal to obtain normalized audio data. Feature extraction: extracting text features from the preprocessed text data by using named entity recognition, keyword extraction and other methods; extracting image features from the preprocessed image data by utilizing methods such as optical character recognition, word vector representation and the like; extracting audio features from the preprocessed audio data by using methods such as MFCC, topological graph clustering and the like; and extracting user behavior characteristics from the user behavior log by using methods such as statistical analysis, intention adjustment and the like. Feature normalization: and processing the extracted text features, image features, audio features and user behavior features by adopting different normalization strategies respectively, so that the scale and the distribution of the extracted text features, image features, audio features and user behavior features are more consistent. Multi-view fusion: constructing the normalized different types of features into a multi-view feature matrix; learning the weight coefficient of each view through a multi-view representation learning algorithm and an alternate optimization strategy; and weighting and fusing the features of different views by using the learned weight coefficients to obtain a final fused feature vector. Deep learning modeling: inputting the fusion feature vector into a predefined neural network model; through forward propagation and backward propagation, learning a nonlinear mapping relation between the features and government matters; and recommending and predicting the government affair matters by using the trained neural network model. And (5) outputting a recommendation result: and carrying out post-processing and visualization on the output result of the neural network model, generating a final government affair item recommendation list, and presenting the final government affair item recommendation list to a user. The entire data flow can be summarized as: multi-mode data input, data preprocessing, feature extraction, feature normalization, multi-view fusion, deep learning modeling and recommendation result output. The data flow and interaction between different modules are processed and refined layer by layer to finally form the characteristic representation with stronger relevance and semantics for driving the downstream recommendation task.

The embodiment of the specification provides a government affair item dialogue recommendation method based on multidimensional vector fusion, which comprises the following steps: collecting multi-modal data including user text, user behavior logs, images and audio signals; extracting text characteristics from the text of the user by using a natural language processing method; extracting user behavior characteristics from the user behavior log by using a statistical method; extracting image features from the image by using a computer vision method; extracting audio features from the audio signal by using a digital signal processing method; respectively carrying out normalization processing on the extracted text features, the user behavior features, the image features and the audio features; adopting a multi-view representation learning algorithm, taking the normalized different types of features as different views, and acquiring weight coefficients of the different views through an objective function; multiplying the feature vectors of different views with the corresponding weight coefficients to obtain a fused multi-mode feature representation; and constructing a neural network model, taking the fused multi-mode characteristic representation as input, and outputting a government affair item recommendation result.

Among them, natural language processing is an important branch of artificial intelligence, intended to allow computers to understand, generate and process human language. The NLP method utilizes computer science and linguistic theory to develop a series of algorithms and models for analyzing and representing information at multiple levels, such as grammar, semantics, emotion, etc., of natural language text. In the present application, natural language processing methods are used to extract key features from user text. The technical route which can be adopted specifically is as follows: the text is preprocessed by the methods of word segmentation, part-of-speech tagging and the like, then multi-granularity features such as word bags, TF-IDF, theme distribution, named entities and the like of the text are extracted, and then the features are subjected to dimension reduction by the semantic representation methods such as word vectors, sentence vectors and the like, so that a structured low-dimensional text feature vector is finally generated, and the subsequent feature fusion is facilitated.

Computer vision is a discipline in which a computer is used to simulate the visual function of a person, allowing a machine to "see" and understand the world from an image or video. The method integrates theories and technologies in a plurality of fields such as image processing, pattern recognition, machine learning and the like, and aims at endowing a computer with intelligence for analyzing, recognizing and tracking visual objects. In the present application, computer vision is used to extract key features from image data. The feasible technical paths are as follows: the method comprises the steps of preprocessing images by size normalization, denoising and the like, and extracting multi-scale and multi-granularity features of the images by using classical manual features (such as SIFT and HOG) or a pre-trained CNN model (such as VGG16 and ResNet) to obtain a high-dimensional image description vector. In order to highlight semantic information related to government affair scenes, fine adjustment and enhancement can be performed by combining tasks such as scene classification, OCR and the like on the basis of extracted CNN features. Finally, the dimension of the high-dimension image features is reduced and normalized to generate a compact image feature representation.

Among them, digital signal processing is a subject of various forms of information processing such as representation, transformation, extraction, enhancement, filtering, estimation, etc., using a digital computer or a dedicated device. The processing object can be various digitized signals of voice, image, video, radar and the like. The DSP method designs an efficient and practical signal processing algorithm by using mathematical tools such as a signal and system, probability theory, optimization theory and the like, and is used for analyzing the time-frequency domain characteristics of the signal, removing noise interference of the signal and recovering the essential content of the signal. In the present application, digital signal processing is used to extract speech and speaker characteristics from audio data. The technical path may be: the original audio is preprocessed by noise reduction, endpoint detection and the like to obtain pure voice fragments. And then adopting MFCC, PLP, fbank and other classical voice characteristics or a voice deep learning model (such as DEEP SPEAKER) to perform characteristic extraction, and capturing the acoustic properties of the voice signals, such as frequency spectrum, fundamental tone, formants and the like. Voice print recognition and other techniques are then used to separate the identity of the speaker from the speech. Finally, the voice and speaker characteristics are subjected to characteristic selection, dimension reduction and normalization, and a standard audio characteristic vector is output.

Further, extracting text features from the user text by using a natural language processing method includes: word segmentation processing is carried out on the input user text to obtain a word sequence { W _i }; part of speech tagging is carried out on the word sequence { W _i } after word segmentation to obtain a part of speech sequence { t _i }; constructing a named entity recognition model based on a conditional random field CRF; taking the word sequence { W _i } and the corresponding part-of-speech sequence { t _i } as observation sequences, and inputting the constructed CRF named entity recognition model; taking the vocabulary and the phrase in the government domain dictionary as input features to construct a feature template; training a CRF named entity recognition model by using the marked government affair text data set; inputting the observation sequence into a trained CRF named entity recognition model, and carrying out sequence labeling by adopting a Viterbi algorithm; and acquiring words marked as department entities and treaty entities from the marking result to obtain named entity characteristics as first-class text characteristics.

The conditional random field is a sequence labeling model commonly used in the field of natural language processing. The sequence labeling is to give an observation sequence (such as word sequence) and predict the corresponding tag sequence (such as part of speech and named entity category). CRF is a discriminant undirected graph model that can be viewed as an extension of the maximum entropy markov model. Unlike generative models (e.g., hidden Markov models), discriminant models model the conditional probability P (y|x) directly, where x is the observation sequence and y is the tag sequence. In the present application, CRF is used to identify department and treaty related named entities from user text. The input observation sequence is word sequence after word segmentation and word part thereof, and the output tag sequence is a naming entity tag in BMEO format (B represents entity start, M represents entity middle, E represents entity end, O represents non-entity). In order to improve the accuracy of recognition, dictionary features in the government field are integrated. The prior knowledge can guide the CRF model to pay more attention to related key words in the field, and interference of irrelevant words is reduced.

The viterbi algorithm is a dynamic programming algorithm for Hidden Markov Model (HMM) decoding, and aims to solve an optimal state sequence, i.e. given model parameters and an observation sequence, and infer the most probable hidden state sequence. The viterbi algorithm is essentially a heuristic search algorithm based on the maximum posterior probability criterion, which exploits the idea of dynamic programming to avoid exhaustive searches. In the application, the state set of the HMM model is Q= { Q ₁,q₂,......,q_N }, the observation set is V= { V ₁,v₂,......,v_M }, the state transition probability matrix is A, the observation probability matrix is B, and the initial state probability vector is pi.

Given an observation sequence O= { O ₁,o₂,......,o_T } with a length of T, the Viterbi algorithm needs to find the optimal state sequenceI= argmaxP (i|o) = argmaxP (O, I) = argmaxP (o|i) P (I) is satisfied. Bayesian formulas and Markov are used herein. P (o|i) represents the likelihood probability that a given state sequence generates an observation sequence, and P (I) represents the prior probability of the state sequence itself. To efficiently solve this maximization problem, the Viterbi algorithm defines two auxiliary variables ：δ(t,i)＝maxP(o₁,o₂,......,o_t,i₁,i₂,......,i_t＝q_i).ψ(t,i) to represent the t-1 st node of the path with the highest probability among all the individual paths whose state is q _i at time t, where δ (t, i) represents the probability of the locally optimal path ending with state q _i at time t and ψ (t, i) is the optimal precursor state of the path at time t-1.

Using the dynamic programming principle, these two variables can be recursively calculated ：δ(1,i)＝πibi(o₁),δ(t,i)＝max[δ(t-1,j)a_ji]bi(o_t),ψ(t,i)＝argmax[δ(t-1,j)a_ji]. by recursively calculating δ and ψ from left to right until the last position T of the observed sequence. Delta (T, i) is now the probability of the optimal path ending in state i, and ψ (T, i) is the penultimate node of the path. Then, starting from ψ (T, I), all nodes of the optimal state sequence I can be obtained by backtracking from right to left. This backtracking process is: it=argmax δ [ T, i ], it=ψ (t+1, it+1), t=t-1, T-2.

Further, extracting text features from the user text by using a natural language processing method, further includes: acquiring text data in the government field, and performing word segmentation to obtain a word sequence { C _i }; taking a government domain dictionary as a candidate feature vocabulary, counting the occurrence frequency of each vocabulary in the candidate feature vocabulary in a word sequence { C _i }, and calculating word frequency TF; counting the number of documents containing each vocabulary, and calculating an inverse document frequency IDF; multiplying TF and IDF to obtain TF-IDF value of each vocabulary as vocabulary weight; constructing a TF-IDF model according to the vocabulary weight, and training the constructed TF-IDF model; calculating the weight value of each vocabulary by adopting a trained TF-IDF model for the word sequence { W _i } corresponding to the input user text; setting a weight threshold value, and selecting words with weight values larger than the threshold value to obtain keyword features as second-class text features; and splicing the first type text features and the second type text features to obtain a final text feature vector tf.

The dictionary in the government affair field is a vocabulary set specially aiming at texts in the government affair field and comprises key vocabularies such as professional terms, organization names, policy regulations and the like which are commonly used in the field. These words typically have explicit business semantics that can represent the subject matter of the text. Compared with a general dictionary, the coverage of the domain dictionary is more focused and accurate, and the effects of text feature extraction and information extraction are improved. In the present application, the government domain dictionary is mainly used for two aspects: firstly, a feature template which is used for identifying CRF named entities is used for guiding the model to focus on related entities in the field; and secondly, screening out domain related keywords as candidate feature vocabularies of the TF-IDF model. Therefore, the domain dictionary plays a role in focusing and guiding, so that the text features are closer to the government topics, and the interference of irrelevant words is reduced. Of course, the quality of the domain dictionary largely determines the expression capability of text features, so that great manpower and material resources are required to construct and update the dictionary.

The inverse document frequency is a statistical index for measuring the importance of words, and is often used in combination with word frequency (TermFrequency, TF) to form TF-IDF weights. The basic idea of IDF is that if a word appears in many documents, its degree of distinction is low; conversely, if a word appears in only a few documents, its degree of distinction is higher. In the application, the IDF is multiplied with the TF to obtain the TF-IDF weight of the vocabulary, which is used as the basis for screening the keyword characteristics. The weight calculation mode integrates the local importance (TF) and the global Importance (IDF) of the vocabulary, and can reflect the contribution degree of the vocabulary to the text semantics more accurately.

The TF-IDF is a simple but effective text representation model and is widely applied to the fields of information retrieval, text mining and the like. As the name suggests, TF-IDF consists of two parts, word frequency (TF) and Inverse Document Frequency (IDF), for assessing the importance of vocabulary to text. The higher the frequency of occurrence of a word in a document and the lower the frequency of occurrence in all documents of the entire corpus, the more representative the word is of the topic of the document. In the application, TF and IDF values of each vocabulary are firstly counted on the corpus in the government field, and a TF-IDF model is constructed. Then, the model is used for vectorizing the text input by the user, and keywords with weights greater than a threshold value are screened out as text features. It can be seen that the TF-IDF model plays a role in feature selection, and the importance of related vocabularies in the field is highlighted by weighting the original vocabularies, and a large amount of redundant and noise vocabularies are filtered, so that text features are more concise and efficient.

Further, extracting the user behavior feature from the user behavior log by using a statistical method includes: analyzing an input user behavior log to obtain the quantity of various materials submitted by a user in each government transaction handling process to form a quantity sequence { m _i }, wherein each element in the quantity sequence { m _i } corresponds to the submitted quantity of one material; acquiring the number n of times of stopping handling in the process of handling each government affair by a user; counting total sum of all materials submitted by a user; for each element r _i in the number sequence { m _i }, calculating the number of submissions ratio p _i＝r_i/sum of the material corresponding to the element r _i; the material submitting times proportion p _i is subjected to descending order to obtain an initial intention ordering sequence q for the user to transact each government affair; according to the number n of the aborted handling, the initial intention ordering sequence q is adjusted to obtain a final user behavior feature sequence r: when n=0, r=q; when n is more than 0, shifting the top K-bit element in q backwards by K bits to form an adjusted sequence r; the user behavior feature sequence r is taken as the user behavior feature uf extracted from the user behavior log.

Specifically, q is a service intention order generated according to various behaviors (such as inquiring, clicking, filling in a form, submitting an application, etc.) of the user in the process of handling the government service, and reflects the preference degree of the user for different services. The earlier the traffic is ordered, the more interesting it is to the user. But the real intent of the user is not only manifested in explicit behavior, but also implicit in random behavior. The random behavior is that the user stops the current business for some reasons in the business handling process, and goes to handle other business. Random behavior typically indicates that the user is not very interested in the current service or that there are some obstacles that cause the user to be unable to complete the current service. Thus, the random behavior itself also contains important information of the user's intention. The core of the adjustment scheme is to correct the user intention through the random behavior times n, so as to obtain more accurate user behavior characteristics.

Specifically, two cases are set: when n=0, i.e. the user does not have any random behavior, q is equivalent to the user's real intention without adjustment; when n >0, i.e. the user has a random behaviour. At this point we first identify the potential cause of the random behavior. In general, popular traffic at the front of the ranking is more prone to random behavior because they attract a large number of user attempts, but users may experience some problems in the actual handling process and cease. To reveal these masked problems, the scheme adopts a smoothing strategy, i.e., shifting the first K elements of q back by K bits, resulting in an adjusted sequence r. This amounts to penalizing those traffic with high random behavior rates, reducing their weight in the intent ranking, while promoting the weight of traffic with low random behavior rates. This smoothing approach makes it possible, on the one hand, to correct the deviation of the intent of the random behavior and, on the other hand, to avoid unduly penalizing the traffic with a high random behavior rate, since they represent the basic preference of the user after all. The smoothed sequence r can reflect the real attitudes of the user to various services more comprehensively and objectively.

Further, extracting image features from the image using a computer vision method includes: denoising the input image I by adopting a Gaussian filtering method to obtain a denoised image I'; carrying out direction correction on the image I 'by adopting a Hough transformation method to obtain a preprocessed image I'; taking the preprocessed image I' as input, and inputting a pre-trained optical character recognition model OCR; positioning and identifying a text region in the image I' by utilizing the OCR model to obtain text content texts; word segmentation is carried out on the text content texts by adopting a word segmentation algorithm, so as to obtain word sequences words; acquiring the position information of each word in the text content texts to obtain a word position sequence positions; converting word sequences words into word vector sequences wv; converting the word position sequence positions into a position vector sequence pv; splicing the word vector sequence wv and the position vector sequence pv to obtain image feature vectors imf, imf = [ wv, pv ]; the image feature vector imf is taken as an image feature extracted from the input image I.

The Gaussian filtering is a common image smoothing and denoising method, a Gaussian kernel is slid on an image, each element in the kernel is multiplied by a pixel value covered by the kernel, all products are added, and the sum of the kernel elements is divided to obtain a new value of a central pixel. This process is called convolution (convolution). The physical meaning of convolution is to replace the original pixel value with a weighted local average, while the gaussian kernel functions to determine the weights of the different neighborhood locations. In the application, gaussian filtering is used as the first step of image preprocessing, and aims to remove noise interference in an input image and provide cleaner and more stable input for subsequent tasks such as direction correction, character recognition and the like. This embodies one of the fundamental principles of computer vision tasks, namely denoising and then recognition. Gaussian filtering is widely applied in the field of image processing by the characteristics of simplicity, high efficiency and adjustability.

The hough transform is a classical algorithm for detecting regular shapes (such as straight lines, circles and the like) in images, and has important application in tasks such as object identification, image registration and the like. Firstly, carrying out edge detection (such as Canny operator) on an image to obtain a binarized edge map; then, for each edge point (x, y), it is transformed into a parameter space (ρ, θ) and 1 is added at the corresponding position (ρ, θ) of the parameter space; thus, collinear points in the image space converge on the same (ρ, θ) point in the parameter space; finally, the (ρ, θ) points with the highest accumulated values are found in the parameter space, which correspond to the lines that are most likely to exist in the image space. In the present application, hough transform is used for performing direction correction on an image. This is because the image may be tilted or rotated to some extent during photographing or scanning, which may cause inconvenience in subsequent character recognition. By using hough transform, we can detect the main straight lines (e.g. the borders of the table) existing in the image, calculate their deviation angles from the horizontal direction, and then perform rotation correction on the whole image so as to make it as horizontal as possible. The preprocessing can reduce the difficulty of character recognition and improve the recognition accuracy.

OCR is a model used to convert text content in an image into computer-encoded text, such as extracting text from a scanned or captured image, and converting it into a text file. Depth OCR models may be employed, such as Google's CRNN, stanford AttentionOCR, hundred degrees RARE, and the like. In the application, the OCR model is the core of extracting image characteristics, and the function of the OCR model is to identify all text contents from images, so that a basis is provided for subsequent NLP analysis. In order to ensure the accuracy of OCR, the image is subjected to preprocessing such as denoising and direction correction. Meanwhile, considering that the images uploaded by the users may have diversity (such as from different scenes and devices), the embodiment adopts a pre-trained OCR model, and the training is performed on large-scale multi-source data, so that the method has better generalization performance. Of course, if targeting is to be further improved, fine tuning of the OCR model may also be performed on the image data of the government scene.

Where word segmentation is one of the basic tasks of natural language processing, and refers to the segmentation of a continuous text sequence into a series of meaningful lexical units. Word segmentation algorithms are also different due to the different vocabulary formation modes of different languages. A method based on character string matching, a method based on statistical model, a method based on deep learning can be adopted. In the present application, word segmentation is an intermediate step in extracting image feature vectors. The purpose of this is to further refine the text content extracted by OCR from a sequence of characters to a sequence of words in preparation for a subsequent vectorized representation.

Further, extracting audio features from the audio signal using a digital signal processing method includes: carrying out noise elimination treatment on the input audio signal a by adopting spectral subtraction to obtain a noise-eliminated audio signal a'; performing end point detection on the audio signal a ', and removing a silent section to obtain a preprocessed audio signal a'; the audio signal a' is subjected to equidistant frame segmentation by adopting a Hamming window method, and an audio signal frame sequence F= { F [1], F [2],. Extracting the characteristics of the frame f [ i ] by adopting a Mel Frequency Cepstrum Coefficient (MFCC) algorithm to obtain an audio characteristic vector v [ i ] of the corresponding frame; combining the audio feature vectors V [ i ] of all frames into an audio feature matrix V; the Euclidean distance between vectors in the audio feature matrix V is calculated, and a distance matrix D is generated: Wherein D [ i, j ] represents the euclidean distance between the audio feature vectors of the i-th frame and the j-th frame; and reducing the dimension of the distance matrix D by adopting a multidimensional scaling algorithm MDS to obtain a dimension-reduced audio signal symbol graph matrix G, wherein each row of G corresponds to the coordinate of an audio signal in a dimension-reduced space.

The basic idea of the spectral subtraction is to subtract the estimated spectrum of noise from the spectrum of noise-contaminated speech, so as to obtain a spectrum estimate of clean speech. In the application, a Short Time Fourier Transform (STFT) is performed on a noise voice signal to obtain a frequency spectrum thereof; noise is estimated. Typically noise is additive, stationary, and uncorrelated with speech. Then estimating the spectrum of the noise in the non-speech segments of the speech (i.e., silence segments or background segments); subtracting the estimated noise spectrum from the spectrum of the noise voice to obtain the spectrum estimation of the pure voice; post-processing the subtracted spectrum, such as introducing a gain factor, correcting a negative value and the like, so as to reduce music noise after spectrum subtraction; and performing inverse Fourier transform on the corrected frequency spectrum to obtain a noise-reduced time domain voice signal. Spectral subtraction is the first step in audio preprocessing in order to eliminate the effect of ambient noise on speech quality. This helps to improve the performance of subsequent feature extraction and pattern recognition.

Among these, end point detection is one of the preprocessing steps of speech recognition, which aims to accurately detect the starting point and ending point of speech from a continuous speech stream, and thus to separate speech from non-speech (i.e., silence or background noise). In the application, the short-time energy of the voice signal is calculated, when the energy exceeds a certain threshold value, the voice is judged, otherwise, the voice is not voice; calculating the zero crossing rate of the voice signal, judging the voice signal to be voice when the zero crossing rate is lower than a certain threshold value, and judging the voice signal to be non-voice if the zero crossing rate is lower than the threshold value; the energy and the zero crossing rate are combined, and a double-threshold judgment rule is designed to improve the reliability of detection; and smoothing the preliminary judgment result to eliminate short-time mutation and obtain a final voice endpoint. Endpoint detection is used to remove silence segments from an audio signal, leaving only speech segments that may contain keywords. The preprocessing can greatly reduce the data volume of the subsequent feature extraction and accelerate the processing speed. At the same time, removing silence amounts to further noise reduction, thus helping to extract more reliable audio features.

The hamming window is a commonly used time domain windowing function and is commonly used for short-time analysis of non-stationary signals such as voice. The short-time analysis is to divide a long time domain signal into a series of short time segments (i.e. frames), then perform frequency domain transformation on each frame signal to obtain a frequency spectrum or cepstrum, and then perform subsequent feature extraction and pattern recognition. The window function may be a rectangular window, a triangular window, a hamming window, a hanning window, or the like. In the application, a Hamming window is adopted to frame an audio signal, on one hand, the application aims to facilitate the extraction of the short-time characteristics of voice and to characterize the local frequency change mode of the voice; on the other hand, to coordinate with subsequent MFCC feature extraction. MFCC features characterize the spectral envelope of speech, which is more sensitive to stationary. And windowing can enhance the stability of the voice in the frame, thereby providing a better premise for spectrum analysis. Meanwhile, the continuity is maintained through a certain overlap between frames, and the loss of information is reduced.

Among them, MFCC is a commonly used speech feature representation method. In the application, the high-frequency part of the voice signal is usually small in amplitude and is easy to be submerged by noise, so that the high-frequency part needs to be compensated, and the signal-to-noise ratio is improved. Pre-emphasis can be implemented with a first order high pass filter: h (z) =1-az ^-1, 0.9 < a < 1, where a is the pre-emphasis coefficient, usually about 0.97. And framing and windowing. The speech signal is divided into a series of short frames and each frame is multiplied by a hamming window to enhance intra-frame stationarity. Short-time fourier transform. And performing Fast Fourier Transform (FFT) on each frame of the windowed signal to obtain a frequency spectrum of the signal. Mel filtering. The spectrum is passed through a set of triangular band-pass filters of mel scale to obtain a mel spectrum. The mel scale is a nonlinear frequency scale that matches the auditory characteristics of the human ear, defined as: m=2595×log10 (1+f/700), where f is the Hz frequency and m is the corresponding mel frequency. Taking the logarithm. Taking the logarithm of the Mel spectrum to obtain a logarithmic Mel spectrum. The objective of taking the logarithm is to change the multiplicative spectrum of the voice signal into an additive spectrum, so that the subsequent cepstrum analysis is convenient. Discrete cosine transform. Discrete Cosine Transform (DCT) is carried out on the logarithmic Mel frequency spectrum to obtain Mel frequency cepstrum coefficient. The DCT functions to compress spectral energy onto low-order coefficients, highlighting the most important spectral features. Dynamic characteristics. To characterize the dynamic changes of the speech parameters, the first-order difference (Δmfcc) and second-order difference (Δmfcc) of the MFCC are also typically calculated, spliced together with the static MFCC, to form a complete feature vector. MFCCs are used to extract compact and efficient feature representations from audio signals, which are the basis for subsequent construction of acoustic models and speech recognition. Through cluster analysis and dimension reduction mapping of the MFCC features, the distribution relation of different audio samples in the feature space can be intuitively depicted. This symbolized representation of the audio signal helps to find the speech pattern hidden under the waveform.

Among these, multi-dimensional scaling is a classical method of data reduction and visualization, whose purpose is to map high-dimensional data into low-dimensional space (typically two-dimensional or three-dimensional) while maintaining similarity or dissimilarity relationships between samples, to reveal the inherent structure and correlation patterns of the data. In the present application, the input to the MDS algorithm is a distance matrix D, where D [ i, j ] represents the dissimilarity between the ith sample and the jth sample, and may be any measure of Euclidean distance, manhattan distance, etc. The goal of MDS is to find a set of low-dimensional coordinates X such that the euclidean distance of the sample in the low-dimensional space is as close as possible to the dissimilarity in the original space. The mathematical language describes that the stress function is minimized: stress (X) =sum (dist (X [ i ], X [ j ]) -D [ i, j ]) ², wherein dist (X) represents the euclidean distance of the low dimensional space. To solve the above-mentioned optimization problem, MDS algorithms generally employ iterative strategies, which are mainly divided into two major classes, namely metric MDS (Metric MDS) and Non-metric MDS (Non-METRICMDS): and measuring MDS. The distance matrix D satisfies all properties (such as nonnegativity, symmetry, triangle inequality, etc.) of euclidean distance, and can be directly solved by using a eigenvalue decomposition method. The method comprises the following specific steps: converting the distance matrix D into an inner product matrix B; performing eigenvalue decomposition on the B to obtain eigenvalues and eigenvectors; and taking feature vectors corresponding to the first d maximum feature values to form coordinates of the sample in d-dimensional space. Wherein the selection of d can be determined according to the magnitude of the eigenvalue and the cumulative contribution rate. The solution obtained by measuring MDS is globally optimal and has uniqueness. Non-metric MDS. When the distance matrix D does not satisfy the euclidean distance property, it needs to be processed with non-metric MDS. Non-metric MDS does not require elements of D to meet any metric property, but rather only their order relationship is consistent with the dissimilarity order of the samples. Thus, non-metric MDS is actually an optimization problem for the order scale. Common solving algorithms are a shepherd-Kruskal algorithm and SMACOF algorithm, and the basic flow is as follows: randomly initializing a low-dimensional coordinate X; calculating Euclidean distance matrix dist (X) of the sample in a low-dimensional space according to X; adjusting elements of dist (X) to be consistent with the order relation of D to obtain D-hat; minimizing Stress function Stress (X) = (dist (X) -d-hat) ², updating X; Repeating until Stress (X) converges or reaches the maximum number of iterations. The solution obtained by non-metric MDS is usually only locally optimal and is greatly affected by the initial value. The MDS is used to learn a two-dimensional audio signal symbol map from the inter-frame distance matrix of MFCC features. Each audio frame is represented on the symbol diagram by a dot, the geometric distance between the dots corresponding to the degree of acoustic dissimilarity between the frames. Intuitively, points that come together on a graph of symbols belong to acoustically similar frames, likely from the same phoneme or syllable; whereas the scattered points belong to frames with larger acoustic differences, possibly from different speech segments. Thus, through analysis of the shape and distribution of the symbol graph, the internal structure of the audio sequence can be insight into the pattern of occurrence of speech events or keywords.

Further, extracting audio features from the audio signal using a digital signal processing method, further includes: taking each row in the matrix G as a point in a low latitude space, wherein each point corresponds to one audio signal frame f [ i ]; constructing a topological graph T, wherein nodes are audio signal frames f [ i ]; calculating Euclidean distance d [ i, j ] between any two points in the low latitude space, and taking the Euclidean distance d [ i, j ] as the weight of the edge between the corresponding audio signal frame nodes in the topological graph T; searching the connected subgraphs on the topological graph T by adopting a depth-first search algorithm to obtain a connected subgraph set C= { C [1], C [2],. Extracting feature vectors S [ k ] from each connected subgraph c [ k ] to obtain a voiceprint feature vector set S= { S [1], S [2],. Clustering the voiceprint feature vector set S by adopting a K-means clustering algorithm to obtain a clustering result K= { K [1], K [2],. Calculating a center vector c [ i ] of each category k [ i ] as a candidate voiceprint feature vector; selecting a vector with the largest inter-class distance from the candidate voiceprint feature vectors as a final voiceprint feature vector v; the voiceprint feature vector v is taken as the audio feature af extracted from the input audio signal a.

In the present application, the low-dimensional space refers to a two-dimensional or three-dimensional euclidean space mapped from the high-dimensional MFCC feature space by the MDS algorithm. Each audio signal frame is represented by a high-dimensional vector in the MFCC feature space and corresponds to a two-dimensional or three-dimensional coordinate point in the low-dimensional space. Wherein a topology graph is a commonly used data structure for representing adjacency and connectivity relationships between samples. In the application, the topological graph T takes an audio signal frame as a node, takes the acoustic similarity among frames as an edge, and describes the evolution track of the audio stream in time and acoustic space. Each node in the topology map T corresponds to a low-dimensional coordinate point, representing an audio signal frame; edges between nodes represent euclidean distances of two frames in a low dimensional space, reflecting their acoustic dissimilarity. The larger the weight of the edge, the more obvious the difference between two frames is explained; conversely, the smaller the weight of an edge, the closer the two frames are.

The depth-first search is a commonly used graph traversal algorithm, and the basic idea is to start from a certain initial node of the graph, and continuously search forward along a path until the graph cannot continue; then backtracking to the nearest intersection, selecting another path to continue exploration, and repeating the steps until the complete graph is traversed. In the present application, the depth first search DFS algorithm maintains a stack (stack) to record the accessed nodes and uses a tag array (visited) to identify whether each node was accessed. One node at a time is fetched from the top of the stack, marked as accessed, and then all of its non-accessed adjacent nodes are pushed onto the stack. When the stack is empty, the search process ends. First, a node which is not accessed is arbitrarily selected as a starting point, a DFS function is called, all nodes which are reachable from the node are found, and the nodes form a connected subgraph. Then, another node which is not accessed is selected, and the process is repeated until all the nodes are accessed. Finally, we get a segmentation of one connected sub-graph of the topology T.

In graph theory, the connected subgraph refers to a subgraph H of the undirected graph G, where a path exists between any two nodes of H. In other words, a connected subgraph is an internal "connected together" portion of the original graph, whose node set is a subset of the original graph node set, and whose edge set contains all the edges of the original graph that connect the nodes. In the present application, the connected subgraph on the topological graph T represents a group of frames with similar acoustic characteristics and similar time positions in the audio signal. Because the speech signal is composed of a sequence of speech events (e.g., phonemes, words, sentences, etc.) that are sequentially combined, it tends to appear as a "chain" of mutually disjoint ones on a topological graph. These chains correspond to different structural components of the speech signal. Through the DFS algorithm, all connected subgraphs in the topological graph T can be automatically found, and a preliminary structure segmentation of the voice signal is obtained.

Where k-means is a classical unsupervised learning algorithm for dividing the sample set into a pre-specified number (k) of clusters. In the application, k samples are randomly selected as initial centroids; for each sample, calculating its distance to the respective centroid, assigning it to the cluster closest to it; for each cluster, re-compute its centroid (i.e., the mean vector of all samples within the cluster); the two steps are repeated until the centroid is no longer changed or the maximum number of iterations is reached. By k-means clustering, similar voiceprint feature vectors can be automatically classified into one type, and a voiceprint layer segmentation of an audio signal is formed. k-means is used to cluster the voiceprint feature vectors of the connected subgraph to find the dominant speaker or voiceprint pattern in the audio signal. The samples here are feature vectors of a fixed dimension extracted from each connected sub-graph, characterizing the acoustic statistics of all frames within the sub-graph.

Further, outputting a government affair item recommendation result includes: obtaining a text feature vector tf, a user behavior feature vector uf, an image feature vector imf and an audio feature-vector af; calculating text feature vector tf L2 norms of tf 2; obtaining a normalized text feature vector tf' =tf/|tf|2 according to the norm tf||2; calculating the minimum minuf and maximum maxuf of the elements in the user behavior feature vector uf; obtaining normalized user behavior feature vector uf' = (uf-minuf)/(maxuf-minuf) according to the minimum minuf and the maximum maxuf; calculating a mean meanimf and a standard deviation stdimf of elements in the image feature vector imf; according to the mean meanimf and the standard deviation stdimf, a normalized image feature vector imf' = (imf-meanimf)/stdimf is obtained; calculating euclidean norms of the audio feature vector af; according to the norm tf E, obtaining normalized audio features vector af' =af i/i af i E.

Where the L2 norm, also known as euclidean norm or second order norm, is a measure of the length or size of a vector. For an n-dimensional real vector x= (x ₁,x₂,......,x_n), its L2 norm is defined as: In the present application, the L2 norm is used to normalize the text feature vector tf. The purpose of normalization is to unify feature values of different dimensions into a similar dimension for subsequent feature fusion and similarity calculation. By dividing by the L2 norm we transform tf into a unit vector, i.e. its L2 norm is 1, while the direction remains unchanged.

Wherein the Euclidean norm, also called Euclidean norm or L2 norm, is conceptually identical to the L2 norm mentioned earlier, except for the slight difference in sign. It measures the length or size of a vector in euclidean space. In the present application, the Euclidean norm is used to normalize the audio feature vector af in a similar manner to the processing of the text feature vector. By dividing by the Euclidean norm we transform af into a unit vector, eliminating the scale differences between different audio frequencies so that they can be compared and fused in a unified metric space.

Further, a multi-view feature matrix MF is constructed according to the normalized text feature vector tf ', the user behavior feature vector uf', the image feature vector imf 'and the audio feature vector af', mf= [ tf ', uf', imf ', af' ]; construction of an objective function L (W) =MF-mf×|w|f ² +λw|1, wherein W is a weight matrix, and lambda is a regularization parameter; the W F represents the Frobenius norm; the L1 norm is represented by the L1 norm, wherein W is a weight matrix and lambda is a regularization parameter; solving an objective function L (W) by adopting an alternative optimization algorithm to obtain an optimal weight matrix W; obtaining weight vectors wtf, wuf, wimf and waf of different feature views from the optimal weight matrix W; the F represents the Frobenius norm, which is defined for a matrix a as the square of the sum of the squares of the matrix elements, i.e.Where a _ij is the ith row and jth column element of matrix A, and Σ (i, j) represents the summing operation over all elements of the matrix. Here, (i, j) represents a row index and a column index of the matrix, and the square value of each element is accumulated by traversing all rows and columns. The L1 norm is denoted by L1 norm, which for a matrix a is defined as the sum of the absolute values of the matrix elements, i.e./>Where a _ij is the ith row and jth column element of matrix a. The term MF-MF x W F ² represents the difference between the multi-view feature matrix MF and its reconstructed matrix MF x W, measured using the Frobenius norm. The aim is to learn a weight matrix W such that the reconstructed matrix MF x W approximates the original matrix MF as much as possible. The term λ W1 is an L1 norm regularization term, for contributing to the sparsity of the weight matrix W. The L1 norm helps to compress some elements in the weight matrix to 0, thereby achieving the effect of feature selection. The regularization parameter lambda controls the strength of the sparsity, the greater lambda, the sparsity of the weight matrix W. By minimizing the objective function L (W), we can learn an optimal weight matrix W, so that the reconstruction error is minimized, while the weight matrix is sufficiently sparse. Therefore, the W is used for carrying out weighted fusion on the characteristics of different views to obtain a final fusion characteristic vector.

Respectively carrying out weighted fusion on the text feature vector tf ', the user behavior feature vector uf', the image feature vector imf 'and the audio feature vector af' and the corresponding weight vectors; calculating a weighted fused text feature vector tfweighted =tf' ×wtf; calculating a weighted and fused user behavior feature vector ufweighted =uf' × wuf; calculating a weighted fused image feature vector imfweighted = imf' × wimf; calculating a weighted fused audio feature vector afweighted =af' ×waf; splicing the weighted and fused feature vectors to obtain a final fused feature vector vfused; vfused = [ tfweighted; ufweighted; imfweighted; afweighted ]; and inputting the fusion feature vector vfused into a pre-trained government affair recommending model to obtain a government affair recommending result.

Furthermore, the alternative optimization algorithm adopts a linear search algorithm PGD, and iterative optimization is performed by constructing a first-order Taylor of the weight matrix W and using a gradient descent method. The alternating optimization is a common strategy for solving complex optimization problems, and is particularly suitable for the situation that an objective function contains a plurality of variable groups or blocks. The basic idea is to break the original problem into several sub-problems, each containing only one variable set, while the other variable sets are fixed. Then, by repeatedly and alternately solving each sub-problem, each variable group is continuously updated and optimized until the global optimum is reached or the convergence condition is satisfied. Specifically, all variable groups x ₁,x₂,......,x_n are initialized; fix x ₂,x₃,......,x_n, optimize x ₁, get x ₁; fixing x ₁*,x₃,......,x_n, optimizing x ₂ to obtain x ₂*,x₄,......,x_n fixed x ₁*,x₂*,......,x_n-1, and optimizing xn to obtain x _n; repeating until convergence or maximum number of iterations is reached.

The linear search algorithm PGD is a method for solving an optimization problem with a non-smooth regularization term, such as Lasso regression, sparse coding, and the like. Decomposing the objective function into two parts: a smooth loss function and a non-smooth regularization function. The variables are then iteratively updated by constructing a first-order taylor of the loss function in combination with an approximation operator (proximal operator) of the regularization function until convergence. In the application, initializing a weight matrix W as a random matrix, and setting iteration times T and learning rate eta; for t=1, 2, &.. the following steps are repeatedly performed: calculating the gradient of the loss function L (W) with respect to WConstructing a first-order taylor of W: /(I)Solving the approximation problem: /(I)Updating the weight matrix: And outputting a final weight matrix W ^t. Where < X, Y > represents the inner product of the matrix, |x||f represents the Frobenius norm of the matrix.

3. Advantageous effects

Compared with the prior art, the application has the advantages that:

And comprehensively utilizing the multi-mode data to comprehensively describe the intention of the user. By collecting multi-source heterogeneous data containing user text, behavioral logs, images, and audio, user potential needs and preferences are mined from different dimensions.

Aiming at the isomerism and redundancy of the multi-mode data, a multi-view representation learning algorithm is adopted to carry out joint optimization on different types of features. By constructing a multi-view feature matrix, a regularized objective function is designed, and the weighting coefficient of each view is adaptively learned by using an alternate optimization strategy.

When the text features are extracted, a CRF named entity recognition model is constructed by using a government domain dictionary, and key entity information such as departments, treatises and the like is effectively recognized. Meanwhile, a TF-IDF algorithm is introduced to calculate the vocabulary weight, and the importance of the government related keywords is highlighted. When the user behavior data is processed, factors such as material submission, handling suspension and the like are comprehensively considered, and the user intention is dynamically adjusted, so that the method and the system more accord with the actual scene of government service.

When complex data such as images, audio and the like are processed, a hierarchical feature extraction strategy from coarse to fine and from low level to high level is adopted. If the image is preprocessed and character-identified, then extracting word vectors and position vectors; firstly, denoising and framing the audio, and then extracting the MFCC coefficients, voiceprint features and the like. The staged and granularity processing mode can reduce data redundancy, reduce calculation complexity and improve the efficiency and the precision of feature extraction.

And inputting the fused multi-modal characteristic representation into a neural network model, and recommending end-to-end government affair matters by using a deep learning method. The deep neural network can learn complex nonlinear relations among features and mine deep semantic association between user intention and matters. Meanwhile, the deep learning model has strong representation capability and self-learning capability, and can continuously optimize a recommendation strategy according to massive user portrait data so as to realize personalized and accurate item matching.

When the audio features are extracted, the low-dimensional embedding of the audio signals is obtained through MDS dimension reduction, and then the topological graph is constructed based on the embedding, so that the relevance among different signal frames is mined. And extracting connected subgraphs on the topological graph, clustering the connected subgraphs, and selecting voiceprint features from clusters with the largest difference between classes.

When different view features are fused, the weight matrix and the fusion features are respectively solved by adopting an alternative optimization strategy, the original problem is decomposed into a plurality of univariate optimization sub-problems, and the solving difficulty is reduced. Meanwhile, a linear search algorithm PGD is introduced to carry out iterative optimization on the weight matrix, and the gradient descent direction is constructed through first-order Taylor expansion, so that efficient linear approximation of the optimization process is realized.

Drawings

The present specification will be further described by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is an exemplary flow chart of a government event dialogue recommendation method based on multidimensional vector fusion according to some embodiments of the present description;

FIG. 2 is an exemplary flow chart for capturing text features according to some embodiments of the present description;

FIG. 3 is an exemplary flow chart for capturing image features according to some embodiments of the present description;

FIG. 4 is an exemplary flow chart for acquiring audio features according to some embodiments of the present description;

FIG. 5 is an exemplary flow chart for obtaining user behavior characteristics according to some embodiments of the present description.

Detailed Description

The method and system provided in the embodiments of the present specification are described in detail below with reference to the accompanying drawings.

FIG. 1 is an exemplary flow chart of a government affair dialogue recommendation method based on multidimensional vector fusion according to some embodiments of the present disclosure, in which there is a government affair dialogue recommendation system. When a user performs government affair inquiry in the system, the system automatically recommends government affair matters most likely to meet the user demands according to the multi-mode information such as texts, images, audios and the like input by the user. First, the system receives input data from the user side. A query by the user may include a text description, an uploaded picture, a voice input, etc. The system stores text data, image data, audio data, etc. in the corresponding data buffers, respectively. And meanwhile, the system also reads the historical behavior log data of the user from a background database. The user performs a query on a certain user in the government affair dialogue recommendation system, and he wants to know the relevant process of enterprise registration.

First, a user enters in the text entry box of the system: "I want to register with a company, ask what flows and materials are needed? After clicking the send button, the system receives the text query and stores it in the text data buffer for subsequent processing. Then, a user clicks a picture upload button to select a business license photo he previously captured and upload the photo to the system. After the system receives the picture data, the picture data is stored in an image data buffer area to wait for further analysis and extraction. Finally, some user also clicks the voice input button, say a sentence with ordinary speech: "do Boss let me register with the company as soon as possible, perhaps one week to go down? The system converts the audio into words through the voice recognition module and stores the words into the audio data buffer area.

At this time, the system has acquired one complete query data of a certain user, which are respectively: text data: "I want to register with a company, ask what flows and materials are needed? "image data: a business license photograph. Audio data: "do Boss let me register with the company as soon as possible, perhaps one week to go down? At the same time, the system also reads the related operation record before a certain user from the background user behavior log database, and the related operation record comprises the following steps: two months ago, a user browsed the "how to register one company only" article page, with a dwell time of 2 minutes. One month ago, a user searched for "registry flow" in the system and clicked on the top 3 of the search results. One week ago, a user also downloaded the "business start application form" document in the system. All the multi-source heterogeneous data, including the text, the image and the audio of the query and the historical behavior log, are saved by the system and used for subsequent links such as feature extraction, semantic analysis, intention recognition and the like, so that the user portrait is accurately depicted and the real requirements of the user are deeply understood. This is the data base that realizes intelligent, individualized government affair dialogue recommendation service. After obtaining these raw data, the system then performs fusion analysis of the multimodal data. Firstly, semantic features are extracted from texts, images and audios respectively, and then feature fusion is carried out through the multi-view feature weighting method to obtain unified user intention representation. And finally, combining the user portrait, and giving a final government affair item recommendation result through a deep learning model, wherein the final government affair item recommendation result can be recommended to related items such as enterprise registration, business registration, seal carving and the like of a certain user.

FIG. 2 is an exemplary flow chart for capturing text features according to some embodiments of the present description, where the system first performs word segmentation and part-of-speech tagging on text data. And extracting key information such as department entities, treaty entities and the like in the text by using a pre-trained named entity recognition model based on CRF. Next, the system also obtains keyword features in the text using a TF-IDF based keyword extraction model. And finally, splicing the named entity features and the keyword features into text feature vectors. Firstly, the system performs word segmentation processing on the text segment to obtain a word sequence: [ "I", "want", "register", "one", "company", "a", "ask", "need", "which", "flow", "and", "material", "are? "]; then, the system marks the parts of speech of the word sequence, obtain the part of speech sequence: [ r (pronoun), v (verb), m (number), n (noun), w (punctuation), v (verb), r (pronoun), n (noun), c (conjunctive), n (noun), w (punctuation) ]. The system then performs entity extraction on the text using a pre-built named entity recognition model based on Conditional Random Fields (CRFs). The CRF model takes a part-of-speech tagging sequence as an observation sequence, and takes words and phrases in a dictionary in the government field as input features to construct a feature template. The CRF model is trained by using the marked government text data set, and weight parameters of different feature functions are learned.

In this example, the system would have the word sequence [ "me", "want", "register", "one", "company", "the", "ask", "need", "which", "flow", "and", "material", "? Inputting the "] and the corresponding part-of-speech sequences into a trained CRF model, and carrying out sequence labeling by adopting a Viterbi algorithm to obtain a named entity recognition result: [ "I"/O, "want"/O, "register"/O, "one"/O, "company"/ORG, ","/O, "ask"/O, "need"/O, "which"/O, "flow"/O, "and"/O, "material"/O, "? "/O ]; wherein a "company" labeled "ORG" is identified as an institutional entity. The system takes such named entities as a first type of text feature. Next, the system extracts keyword features from the text using a TF-IDF based keyword extraction model. Firstly, the system calculates word frequency TF and inverse document frequency IDF of each candidate word from a government field text corpus, calculates TF-IDF values as weights of the words, and constructs a TF-IDF model. Then, the system applies a TF-IDF model to the word sequence input by the user, and calculates a weight value of each word. A weight threshold (e.g., 0.1) is set, and words with weights greater than the threshold are selected as the keyword features, and in this example, keywords such as [ "registration", "company", "process", "material" ] may be extracted as the second type of text features. Finally, the system splices the named entity features of the first class and the keyword features of the second class to form text feature vectors, such as: tf= [0,0,1,0,0,0,0,0,1,1,0,1,0], where 1 indicates that the corresponding vocabulary is a named entity or keyword and 0 indicates that it is not. Thus, a compact, semantically rich text feature representation is obtained, integrating named entity information and keyword information.

FIG. 3 is an exemplary flow chart for acquiring image features according to some embodiments of the present description, where the system first performs denoising and direction correction preprocessing for image data. And then performing character recognition by adopting an OCR model based on deep learning, and extracting text information in the image. The system performs word segmentation processing on the extracted text content to obtain word sequences and corresponding position information. Then, the word sequence and the position information are converted into vector representations, respectively, and spliced into image feature vectors. First, the system preprocesses this picture. And denoising the image by adopting a Gaussian filtering method, so as to reduce noise interference in the image. Specifically, a 5×5 gaussian filter is used, the standard deviation is 1.5, and convolution operation is performed on the original image I to obtain a denoised image I'. Next, the system performs direction correction on the image I'. Because of the possible tilt of the user when shooting, a rotation adjustment of the image is required. The system adopts a Hough transformation algorithm to detect straight lines in the image, calculates a correction rotation matrix according to the inclination angles of the straight lines, carries out affine transformation on the image, and obtains an image I after direction correction. The system then inputs the preprocessed image I "into a pre-trained Optical Character Recognition (OCR) model. The OCR model adopts an end-to-end structure based on a convolutional neural network and a cyclic neural network, and can simultaneously complete character positioning and recognition. The OCR model used by the system is trained on a large number of certificate data such as business licenses and the like, and text areas and contents in images can be accurately extracted.

Through OCR model, the system recognizes the text content from the image I' and obtains a text, such as: texts = "×xcity X region X company unifies social credit codes: 91330106MA2CFXXX9P type: finite responsibility companies (natural person investments or control) statutory representatives: zhang San … … ". Next, the system performs word segmentation on the extracted text content texts. And segmenting the text by adopting jieba word segmentation tools to obtain a word sequence: words= [ "× city", "× zone", "××", "company", "unification", "society", "credit code", ": ","91330106MA2CFXXX9P "," type ",": "Limited liability company", ("nature", "investment", "or", "control") "legal representative", ": "," Zhang San "," … … "]; at the same time, the system also records the position information of each word in the original text, resulting in a position sequence ：positions＝[(0,3),(3,6),(6,9),(9,11),(11,13),(13,15),(15,20),(20,21),(21,37),(37,39),(39,40),(40,46),(46,47),(47,50),(50,52),(52,53),(53,55),(55,56),(56,61),(61,62),(62,64),(64,65)]. where (a, b) indicates that the word starts from the a-th character to the b-th character in the original text. Next, the system converts the word sequence words into a word vector sequence wv. For each word, find its vector representation in the pre-training word vector space, generating a real vector of fixed dimension (e.g., 300 dimensions). Thus, the words sequence is mapped into a word vector sequence, such as: wv= [ [0.1,0.2, …, -0.1], [0.2, -0.3, …,0.5], …, [0.4, -0.1, …, -0.2] ]; similarly, the system also converts the sequence of positions positions into a sequence of position vectors pv. In order to encode the position information into a real number vector with a fixed length, the system performs normalization processing on the position, and maps the normalized position to a high-dimensional space through sine and cosine function transformation to obtain the position vector. Where the position vector dimension is 50 dimensions, then positions sequences are converted into: pv= [ [0.2, -0.4, …,0.1], [ -0.3,0.1, …, -0.4], …, [0.5,0.2, …, -0.1] ]; finally, the system concatenates the word vector sequence wv and the position vector sequence pv to form a complete image feature vector ：imf＝[wv,pv]＝[[0.1,0.2,…,-0.1,0.2,-0.4,…,0.1],[0.2,-0.3,…,0.5,-0.3,0.1,…,-0.4],…,[0.4,-0.1,…,-0.2,0.5,0.2,…,-0.1]];, wherein imf is a vector sequence with length of words sequence length, and each vector is formed by concatenating a word vector (300 dimensions) and a position vector (50 dimensions) in total of 350 dimensions. Thus, the semantic information and the position information of each text in the image are comprehensively encoded.

Fig. 4 is an exemplary flow chart for acquiring audio features according to some embodiments of the present description, where the system first performs noise reduction and endpoint detection preprocessing and converts to a sequence of frames for audio data. Then, mel-frequency cepstral coefficients are extracted for each frame using MFCC algorithm. Based on the MFCC features, a similarity matrix is constructed by euclidean distance and the high-dimensional features are mapped to a low-dimensional space using an MDS algorithm. Voiceprint features in the audio are then extracted using a graph theory algorithm. Finally, the MFCC features and the voiceprint features are spliced into audio feature vectors. First, the system pre-processes the piece of audio. And denoising the original audio signal a by adopting spectral subtraction, and removing background noise interference to obtain a denoised audio signal a'. Then, the system carries out end point detection on the a', and removes the mute interval before and after the voice through double threshold comparison and energy threshold judgment to obtain a signal a) only containing effective voice. The system then performs frame segmentation on the preprocessed audio signal a ". And (3) carrying out equal interval segmentation on a' by adopting a Hamming window function and taking 20ms as a frame length and 10ms as a frame shift to obtain a series of audio frames f1, f 2, and f n, wherein n is the total frame number. Here, 200 frames are cut out in total. Next, the system extracts audio features for each frame. Here, MFCC (mel-frequency cepstral coefficient) is adopted as a characteristic representation. For each frame f [ i ], the system first calculates its short-term energy spectrum, then filters it by Mel filter bank, then takes the logarithm, and finally makes DCT transformation to obtain a 13-dimensional MFCC feature vector v [ i ]. This converts the original speech signal into a series of 13-dimensional MFCC feature vectors.

The MFCC feature vectors of each frame are sequentially combined to obtain an audio feature matrix V of 13200 dimensions, each column being a feature vector of one frame. To explore the relationship between different frames, the system calculates the Euclidean distance between the vectors in matrix V, generating a 200200 distance matrix D. D [ i, j ] measures the degree of similarity of the ith and jth frames in the feature space, with closer distance indicating closer proximity of the two frames. But this original distance matrix is very high in dimension and inconvenient to analyze. The system performs a dimension reduction process on the distance matrix using an MDS (multi-dimensional scaling) algorithm. Through MDS, a 200-dimensional matrix can be reduced to a 2-dimensional plane on the premise of keeping the inter-frame distance relation, and a dimension-reduced audio signal map G is obtained. Each row of G corresponds to the coordinates of one audio frame in two dimensions. With this two-dimensional graph G, the system further extracts voiceprint features of the audio. First, each point in G is regarded as a node, and the points with the distance between the nodes smaller than the threshold value are connected into edges to construct a topological graph T. T reflects the structure of the communication between audio frames. Then the system searches for a connected subgraph on T by depth-first search to obtain a number of connected segments c [1], c [2],. The term. Each segment is a topologically closely connected set of audio frames. For each connected segment c [ i ], the system calculates the average of its 13-dimensional MFCC feature vectors as the feature description s [ i ] of that segment. The feature vectors of all segments thus constitute a voiceprint feature set S. In order to select the most representative voiceprint features from S, the system clusters S by adopting a k-means algorithm, wherein the category number is set to be 3. After clustering, the center vector c [ i ] of each class of segments is used as a candidate voiceprint feature vector. The system calculates the inter-class distance between 3 candidate feature vectors, and selects the sound track feature v with the largest distance as the final sound track feature v. To sum up, the system extracts two types of audio features from a certain user's voice input: MFCC feature matrix V and voiceprint feature vector V. Finally, the system splices the 13-dimensional MFCC features and the voiceprint features of each frame to obtain an audio feature sequence af with the length of 200, and each element is a 14-dimensional vector. This af is a structured feature representation of the input audio extraction.

FIG. 5 is an exemplary flow chart for obtaining user behavioral characteristics according to some embodiments of the present disclosure, where for user behavioral data, the system analyzes user history of transactions, including amounts of submitted materials, proportions of material types, number of aborts, etc. Then, an initial intention order of government matters is constructed, and then the order is adjusted according to the suspension times to form a behavior feature sequence reflecting the actual intention of the user. First, the system performs a statistical analysis on the behavior log of a certain user. Through log analysis, the system finds that a certain user transacts 10 matters such as a registration company, a property certificate, a tax declaration and the like in the year. For each item, the system further obtains the condition that a certain user submits materials, including the number of different materials such as an identity card, a household account book, a business license and the like. There are 8 classes of materials in total, and a 10 x 8 number matrix is formed for the 10 events, each row representing an event, each column representing a class of materials, and the matrix elements being the corresponding number of submissions. Next, the system aggregates the total number of submissions for each class of materials. For example, a user submits a total of 8 identity cards, 5 account books, and 7 business licenses … … out of the 10 matters, and the total sum is added to obtain the sum of the submitted amounts sum of all materials. With the number of submissions and the total number of materials of each type, the system can calculate the proportion of submissions of materials. Also taking an identification card as an example, the ratio of the number of times that a user submits the identification card to the total material number is p1=8/sum. Similarly, the account book has a duty cycle of p2=5/sum and the business license has a duty cycle of p3=7/sum … …, which translates the raw material quantity statistics into a set of submission probability distributions.

Based on the magnitude of these probability values, the system may rank the importance of the material. The highest-duty material tends to mean that a certain user is more inclined to use when transacting business, reflecting his actual intention from the side. The system generates an initial user intent ranking q in order of the material submission probability from greater to lesser. q tells us that a user is most likely to submit an identification card, next to a business license, and again to a household book … …. But only looking at the possible deviation of the material submission quantity, one also needs to consider an important information, namely, the suspension of business handling of a certain user. The system further finds out through the log that among these 10 events, a user has 3 events that have a suspension operation, meaning that he may not be interested in the pre-selected service. At this time, the system needs to adjust the initial intent order q according to the number of aborts n. If a user does not abort anything (n=0), q is directly used as the final behavior feature sequence r. However, if a user has an abort behavior (n > 0), the system moves the 3 materials in front of q backward by 3 bits, resulting in a new sequence r. The logic for the adjustment is that since the user shows "remorse" of the original selection during the handling, the material he was initially highly inclined to may not be so important and should be given over to the later options. After the correction of the suspension times, the system finally determines a user behavior feature sequence uf. The uf integrates the submitting quantity, the submitting proportion and the remorse behavior of the user, and can more comprehensively describe the real intention of the user in the government transaction process. In this example, uf may be a business license, a household, an identification card, other materials … …, reflecting that a user is more likely to have actual needs in terms of creation and real estate.

After the feature vectors of the four modes are obtained, the system respectively performs normalization processing to eliminate measurement differences. The normalized features are then input into a multi-view representation learning model. The model learns the weight coefficients of different views by minimizing multi-view reconstruction errors and weight sparsity constraints. After the weights are learned, the system calculates a weighted combination of the multi-view features to form a fused multi-modal feature representation. First is feature normalization. Since the feature distributions of different modalities may vary widely, direct fusion can affect accuracy. The system performs normalization processing on each feature first and maps it to the same scale. For text feature tf and audio feature af, the system selects the L2 norm normalization, i.e. dividing by the L2 norm of the vector, to be a unit vector. Taking tf as an example the case of a film, the new text is characterized by tf' =tf/| |tf|2. In the same way, the processing method comprises the steps of, audio feature af' = af/|| af E. Image features imf were normalized with mean-standard deviation, i.e., subtracting the mean and dividing by the standard deviation, and transformed to imf' = (imf-meanimf)/stdimf. Here the mean and standard deviation are calculated over imf total elements. For the discrete behavior feature uf, the system uses max-min normalization to linearly map each dimension feature into the [0,1] interval. Specifically, uf' = (uf-minuf)/(maxuf-minuf). Through the transformation, the original isomerism tf, imf, af, uf features are unified into a comparable numerical range.

After normalization, the system concatenates the four features to form a multi-view matrix mf= [ tf ', uf', imf ', af' ]. Each column of the matrix corresponds to a class of feature views. In order to learn the optimal feature combinations from the MF, the system constructs a weighted reconstruction objective function: l (W) =mf-mf×|w i F ² +λw i 1. Where W is the weight matrix to be learned, |X|_F represents the Frobenius norm of the matrix, |X|_1 represents the L1 norm, and λ is the regularization parameter. The significance of this objective function is that the original matrix MF is reconstructed by weighted view combination mf×w, and the weight matrix W itself is also as sparse as possible (as represented by the minimum L1 norm) while minimizing the reconstruction error. This allows for the automatic selection of a subset of views that are important while the features are fused. To optimize this objective function, the system uses a highly efficient approximate linear algorithm PGD (proximal GRADIENT DESCENT). The idea of PGD is to construct a first-order taylor of the objective function at each iteration using the gradient information of the current weight matrix, and then update the weights in this simplified gradient direction. Through iteration many times, the weight matrix can converge to the global optimum W. The PGD algorithm has the advantages of taking the precision and the speed into consideration, and being capable of efficiently processing large-scale multi-view data.

After learning the optimal weights W, the system obtains the weights for each view from the column vectors of the matrix: wtf, wuf, wimf and waf. Then, weights tf ', uf', imf ', af' are used to weight and sum to obtain the fused view vector. Finally, the weighted view vectors are stitched together to obtain a vector representation vfused that fully characterizes the user. For example, if the original feature dimensions are tf (200 dimensions), uf (50 dimensions), imf (500 dimensions), af (100 dimensions), respectively, and the learned weights are wtf (200 dimensions), wuf (50 dimensions), wimf (500 dimensions), waf (100 dimensions), then the final vfused is a feature vector of 200+50+500+100=850 dimensions. The method integrates user descriptions of text semantics, business intent, image content, voice information and other angles, and is more comprehensive and accurate than any single view angle. Finally, the system inputs this fused feature vector vfused into a pre-trained deep neural network model. The model is an end-to-end multi-layer perceptron, is trained on a large number of user figures and data of actual transacted government, and can learn potential interests and service requirements of users from the fused features. The output of the model is a probability distribution of government matters, and the probability of the possible requirement of the current user for each matter is predicted. The system selects TopK items with the highest probability, generates a recommendation list and presents the recommendation list to the user for reference. For example, in the case of a user, the system, by fusing multimodal data, insights that he may have realistic demands for creation and buying a house. Therefore, when the government matters are recommended, service items of relevant topics such as 'company registration', 'official seal inscription', 'house record' and the like are displayed preferentially. The personalized recommendation of thousands of people and thousands of faces can be matched with the real intention of a user, and the accuracy and convenience of government service are improved.

Claims

1. A government affair item dialogue recommendation method based on multidimensional vector fusion comprises the following steps:

Collecting multi-modal data including user text, user behavior logs, images and audio signals;

Extracting text characteristics from the text of the user by using a natural language processing method; extracting user behavior characteristics from the user behavior log by using a statistical method; extracting image features from the image by using a computer vision method; extracting audio features from the audio signal by using a digital signal processing method;

respectively carrying out normalization processing on the extracted text features, the user behavior features, the image features and the audio features;

Adopting a multi-view representation learning algorithm, taking the normalized text characteristics, user behavior characteristics, image characteristics and audio characteristics as different views, and acquiring weight coefficients of the different views through an objective function;

Multiplying the feature vectors of different views with the corresponding weight coefficients to obtain a fused multi-mode feature representation;

And constructing a neural network model, taking the fused multi-mode characteristic representation as input, and outputting a government affair item recommendation result.

2. The government affair item dialogue recommendation method based on multidimensional vector fusion according to claim 1, wherein the method comprises the following steps:

Extracting text features from user text using natural language processing methods, comprising:

Word segmentation processing is carried out on the input user text to obtain a word sequence { W _i };

Part of speech tagging is carried out on the word sequence { W _i } after word segmentation to obtain a part of speech sequence { t _i };

constructing a named entity recognition model based on a conditional random field CRF;

Taking the word sequence { W _i } and the corresponding part-of-speech sequence { t _i } as observation sequences, and inputting the constructed CRF named entity recognition model; taking the vocabulary and the phrase in the government domain dictionary as input features to construct a feature template;

training a CRF named entity recognition model by using the marked government affair text data set;

Inputting the observation sequence into a trained CRF named entity recognition model, and carrying out sequence labeling by adopting a Viterbi algorithm;

And acquiring words marked as department entities and treaty entities from the marking result to obtain named entity characteristics as first-class text characteristics.

3. The government affair item dialogue recommendation method based on multidimensional vector fusion according to claim 2, wherein the method comprises the following steps:

extracting text features from user text by using a natural language processing method, and further comprising:

acquiring text data in the government field, and performing word segmentation to obtain a word sequence { C _i };

Taking a government domain dictionary as a candidate feature vocabulary, counting the occurrence frequency of each vocabulary in the candidate feature vocabulary in a word sequence { C _i }, and calculating word frequency TF;

Counting the number of documents containing each vocabulary, and calculating an inverse document frequency IDF;

Multiplying TF and IDF to obtain TF-IDF value of each vocabulary as vocabulary weight;

Constructing a TF-IDF model according to the vocabulary weight, and training the constructed TF-IDF model;

Calculating the weight value of each vocabulary by adopting a trained TF-IDF model for the word sequence { W _i } corresponding to the input user text;

Setting a weight threshold value, and selecting words with weight values larger than the threshold value to obtain keyword features as second-class text features;

And splicing the first type text features and the second type text features to obtain a final text feature vector tf.

4. The government affair item dialogue recommendation method based on multidimensional vector fusion as claimed in claim 3, wherein the method comprises the following steps:

Extracting user behavior features from the user behavior log by using a statistical method, including:

Analyzing an input user behavior log to obtain the quantity of various materials submitted by a user in each government transaction handling process to form a quantity sequence { m _i }, wherein each element in the quantity sequence { m _i } corresponds to the submitted quantity of one material; acquiring the number n of times of stopping handling in the process of handling each government affair by a user;

Counting total sum of all materials submitted by a user;

For each element r _i in the number sequence { m _i }, calculating the number of submissions ratio p _i＝r_i/sum of the material corresponding to the element r _i;

the material submitting times proportion p _i is subjected to descending order to obtain an initial intention ordering sequence q for the user to transact each government affair;

according to the number n of the aborted handling, the initial intention ordering sequence q is adjusted to obtain a final user behavior feature sequence r:

When n=0, r=q;

when n is more than 0, shifting the top K-bit element in q backwards by K bits to form an adjusted sequence r;

the user behavior feature sequence r is taken as the user behavior feature uf extracted from the user behavior log.

5. The government affair item dialogue recommendation method based on multidimensional vector fusion according to claim 4, wherein the method comprises the following steps:

extracting image features from an image using a computer vision method, comprising:

Denoising the input image I by adopting a Gaussian filtering method to obtain a denoised image I';

Carrying out direction correction on the image I 'by adopting a Hough transformation method to obtain a preprocessed image I';

taking the preprocessed image I' as input, and inputting a pre-trained optical character recognition model OCR;

positioning and identifying a text region in the image I' by utilizing the OCR model to obtain text content texts;

word segmentation is carried out on the text content texts by adopting a word segmentation algorithm, so as to obtain word sequences words;

Acquiring the position information of each word in the text content texts to obtain a word position sequence positions;

Converting word sequences words into word vector sequences wv;

Converting the word position sequence positions into a position vector sequence pv;

Splicing the word vector sequence wv and the position vector sequence pv to obtain image feature vectors imf, imf = [ wv, pv ];

the image feature vector imf is taken as an image feature extracted from the input image I.

6. The government affair item dialogue recommendation method based on multidimensional vector fusion according to claim 5, wherein the method comprises the following steps:

Extracting audio features from an audio signal using a digital signal processing method, comprising:

Carrying out noise elimination treatment on the input audio signal a by adopting spectral subtraction to obtain a noise-eliminated audio signal a';

Performing end point detection on the audio signal a ', and removing a silent section to obtain a preprocessed audio signal a';

The audio signal a' is subjected to equidistant frame segmentation by adopting a Hamming window method, and an audio signal frame sequence F= { F [1], F [2],.

Extracting the characteristics of the frame f [ i ] by adopting a Mel Frequency Cepstrum Coefficient (MFCC) algorithm to obtain an audio characteristic vector v [ i ] of the corresponding frame;

Combining the audio feature vectors V [ i ] of all frames into an audio feature matrix V;

the Euclidean distance between vectors in the audio feature matrix V is calculated, and a distance matrix D is generated:

Wherein D [ i, j ] represents the euclidean distance between the audio feature vectors of the i-th frame and the j-th frame;

And reducing the dimension of the distance matrix D by adopting a multidimensional scaling algorithm MDS to obtain a dimension-reduced audio signal symbol graph matrix G, wherein each row of G corresponds to the coordinate of an audio signal in a dimension-reduced space.

7. The government affair item dialogue recommendation method based on multidimensional vector fusion as claimed in claim 6, wherein the method comprises the following steps:

extracting audio features from an audio signal using a digital signal processing method, further comprising:

Taking each row in the matrix G as a point in a low latitude space, wherein each point corresponds to one audio signal frame f [ i ];

Constructing a topological graph T, wherein nodes are audio signal frames f [ i ];

Calculating Euclidean distance d [ i, j ] between any two points in the low latitude space, and taking the Euclidean distance d [ i, j ] as the weight of the edge between the corresponding audio signal frame nodes in the topological graph T;

searching the connected subgraphs on the topological graph T by adopting a depth-first search algorithm to obtain a connected subgraph set C= { C [1], C [2],.

Extracting feature vectors S [ k ] from each connected subgraph c [ k ] to obtain a voiceprint feature vector set S= { S [1], S [2],.

Clustering the voiceprint feature vector set S by adopting a K-means clustering algorithm to obtain a clustering result K= { K [1], K [2],.

Calculating a center vector c [ i ] of each category k [ i ] as a candidate voiceprint feature vector;

Selecting a vector with the largest inter-class distance from the candidate voiceprint feature vectors as a final voiceprint feature vector v;

The voiceprint feature vector v is taken as the audio feature af extracted from the input audio signal a.

8. The multi-dimensional vector fusion-based government affair item dialogue recommendation method according to any one of claims 1 to 7, wherein:

outputting a government affair item recommendation result, including:

obtaining a text feature vector tf, a user behavior feature vector uf, an image feature vector imf and an audio feature-vector af;

calculating text feature vector tf L2 norms of tf 2;

obtaining a normalized text feature vector tf' =tf/|tf|2 according to the norm tf||2;

Calculating the minimum value min uf and the maximum value max uf of the elements in the user behavior feature vector uf;

obtaining normalized user behavior feature vector uf' = (uf-min uf)/(max uf-min uf) according to the minimum value min uf and the maximum value max uf;

calculating a mean meanimf and a standard deviation stdimf of elements in the image feature vector imf;

According to the mean meanimf and the standard deviation stdimf, a normalized image feature vector imf' = (imf-meanimf)/stdimf is obtained;

Calculating euclidean norms of the audio feature vector af;

According to the norm tf E, obtaining normalized audio features vector af' =af i/i af i E.

9. The government affair item dialogue recommendation method based on multidimensional vector fusion as claimed in claim 8, wherein the method comprises the following steps:

outputting the government affair item recommendation result, and further comprising:

Constructing a multi-view feature matrix MF according to the normalized text feature vector tf ', the user behavior feature vector uf', the image feature vector imf 'and the audio feature vector af', wherein MF= [ tf ', uf', imf ', af' ];

constructing an objective function L (W) =mf-mf×|w|f ² +λ|w||1, wherein W is a weight matrix and λ is a regularization parameter; the W F represents the Frobenius norm; the L1 norm is represented by W1;

solving an objective function L (W) by adopting an alternative optimization algorithm to obtain an optimal weight matrix W;

Obtaining weight vectors wtf, wuf, wimf and waf of different feature views from the optimal weight matrix W;

Respectively carrying out weighted fusion on the text feature vector tf ', the user behavior feature vector uf', the image feature vector imf 'and the audio feature vector af' and the corresponding weight vectors;

and splicing the weighted and fused feature vectors to obtain a final fused feature vector vfused.

10. The government affair item dialogue recommendation method based on multidimensional vector fusion as claimed in claim 9, wherein the method comprises the following steps:

the alternative optimization algorithm adopts a linear search algorithm PGD, and iterative optimization is carried out by constructing a first-order Taylor of a weight matrix W and utilizing a gradient descent method.