CN117077071A - Data analysis method and system based on data classification - Google Patents

Data analysis method and system based on data classification Download PDF

Info

Publication number
CN117077071A
CN117077071A CN202310830398.0A CN202310830398A CN117077071A CN 117077071 A CN117077071 A CN 117077071A CN 202310830398 A CN202310830398 A CN 202310830398A CN 117077071 A CN117077071 A CN 117077071A
Authority
CN
China
Prior art keywords
data
feature
text
image
event data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310830398.0A
Other languages
Chinese (zh)
Inventor
赵刘琦
高兴宇
成静文
文星
陈欢
王朝硕
黄振林
王宁
朱金惟
申晓杰
冯子焰
张宇恒
邱天乙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Microelectronics of CAS
Super High Transmission Co of China South Electric Net Co Ltd
Original Assignee
Institute of Microelectronics of CAS
Super High Transmission Co of China South Electric Net Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Microelectronics of CAS, Super High Transmission Co of China South Electric Net Co Ltd filed Critical Institute of Microelectronics of CAS
Priority to CN202310830398.0A priority Critical patent/CN117077071A/en
Publication of CN117077071A publication Critical patent/CN117077071A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data analysis method and a data analysis system based on data classification, belongs to the technical field of data mining and analysis, and solves the problems of incomplete semantic relation and low availability of associated data in the existing data analysis. Acquiring event data of a power system, including images and text contents thereof; extracting each word segmentation and text feature codes in the text content according to the text feature extractor; extracting each target and image feature code in the image by using an image feature extractor; based on the embedding of the knowledge graph, obtaining the word segmentation existing in the knowledge graph and the embedding vector corresponding to the target; constructing a relation feature vector according to the similarity between the embedded vectors; according to the relation feature vector, the text feature code and the image feature code, a multi-mode feature vector is obtained, and the multi-mode feature vector is transmitted into a hierarchical recognition module to obtain the level of each event data; and according to the CLIP model and the clustering algorithm, analyzing the association relation between the event data of the same level. Accurate data analysis is realized.

Description

Data analysis method and system based on data classification
Technical Field
The application relates to the technical field of data mining and analysis, in particular to a data analysis method and system based on data classification.
Background
The digitalized transformation of enterprises is not separated from high-quality data management and analysis, and at present, new technologies such as machine learning, artificial intelligence and the like are increasingly applied to the aspects of data management and analysis. The new technologies can help enterprises to better understand and apply data, and improve the quality and management efficiency of the data. Important information and modes in the data can be rapidly identified and extracted through technologies such as machine learning, data mining and the like, so that complex data management tasks can be better dealt with. Automated data management and analysis tools can help businesses quickly identify and utilize critical information in data to better make business decisions and discover potential opportunities.
Currently, most of the data asset inventory work in the industry is done manually, using an Excel tool for recording. However, the use of a large number of devices during operation of the converter station and the use of a larger number of control systems for the management of the converter station makes the data volume of the power company extremely large. On the other hand, the operation of the converter station can generate various structured and unstructured data, and huge amounts of associated data such as video and audio data, energy data, weather data and the like exist at the same time. These factors make the characteristics of the power data different from other data, and have characteristics of various data types, huge capacity, and the like. If the data is checked manually, time and labor are wasted, and the related data of different structures, such as abnormal alarms of monitoring video acquisition equipment and historical faults of character records, are subjected to manual analysis, so that misjudgment and omission are possible.
In addition, external information is not introduced in the prior art, so that characteristics among data cannot be accurately extracted, semantic relations are incomplete, importance of the data is not concerned, data containing different levels is analyzed, data analysis and query speed are affected, a plurality of weak association relations are established, so that a plurality of information with weak functions are associated with the data, and use of users is affected.
Disclosure of Invention
In view of the above analysis, the embodiment of the application aims to provide a data analysis method and a system based on data classification, which are used for solving the problems of incomplete semantic relationship and low availability of associated data in the existing data analysis.
In one aspect, an embodiment of the present application provides a data analysis method based on data classification, including the following steps:
acquiring event data of a power system, wherein each piece of event data comprises an image and text content thereof;
extracting each word segmentation and text feature codes in the text content according to the text feature extractor; extracting each target and image feature code in the image by using an image feature extractor;
based on the embedding of the knowledge graph, putting the word segmentation and the target existing in the knowledge graph into a text node set and an image node set respectively, and obtaining the embedding vector of each node; constructing a relationship feature vector of each node according to the similarity among the embedded vectors of each node;
according to the relation feature vector, the text feature code and the image feature code, a multi-mode feature vector of each event data is obtained, and the multi-mode feature vector is transmitted into a hierarchical recognition module to obtain the level of each event data; and according to the CLIP model and the clustering algorithm, analyzing the association relation between the event data of the same level.
Based on the further improvement of the method, the relation characteristic vector is a four-dimensional vector formed by relation values of four groups; the relationship values of the four groups are the sum of the similarities of the groups, among the four groups divided according to the positive and negative values of the similarities and the set to which the nodes to be compared belong.
Based on a further improvement of the above method, the method further comprises: and setting the relation feature vector of the word segmentation and the target which are not in the knowledge graph as a four-dimensional vector consisting of zero.
Based on the further improvement of the method, according to the relation feature vector, the text feature code and the image feature code, the multi-mode feature vector of each event data is obtained, which comprises the following steps:
splicing the relation feature vector of each node in the text node set to the text feature coding head; splicing the relation feature vector of each node in the image node set to the image feature coding head; respectively passing the two spliced feature vectors through two linear layers, and outputting feature representations with the same dimension; and combining the two characteristic representations of the same event data to obtain the multi-mode characteristic vector.
Based on further improvement of the method, after receiving the multi-mode feature vector of each event data, the hierarchical recognition module obtains the level of each event data through linear layer projection and then a softmax layer; the levels include core, importance, and general.
Based on the further improvement of the method, according to the CLIP model and the clustering algorithm, the association relation between the event data with the same level is analyzed, and the method comprises the following steps:
inputting images and text contents of event data of the same level into a pre-trained CLIP model to obtain a fusion feature vector of an input softmax layer;
and classifying the fusion feature vectors by using a clustering algorithm, and establishing a strong association relation for the event data of the same class according to the classification result.
Based on the further improvement of the method, the text feature code and the image feature code have the same length and are obtained by respectively outputting the last hidden layer in the text feature extractor and the image feature extractor through two linear layers, regularization processing and an activation function.
Based on the further improvement of the method, the knowledge graph embedding is to acquire the embedding vector of each entity in the Freebase knowledge graph data set by adopting a TransE model.
Based on further improvement of the method, the text feature extractor adopts a pretrained Bert model, and the image feature extractor adopts a pretrained Vision Transformer model.
In another aspect, an embodiment of the present application provides a data analysis system based on data classification, including:
the system comprises an event data acquisition module, a power system management module and a power system management module, wherein the event data acquisition module is used for acquiring event data of the power system, and each piece of event data comprises an image and text content thereof;
the event feature extraction module is used for extracting each word segmentation and text feature codes in the text content according to the text feature extractor; extracting each target and image feature code in the image by using an image feature extractor;
the joint relation extraction module is used for embedding the word segmentation and the target existing in the knowledge graph based on the knowledge graph, respectively putting the word segmentation and the target into the text node set and the image node set, and obtaining the embedded vector of each node; constructing a relationship feature vector of each node according to the similarity among the embedded vectors of each node;
the data hierarchical analysis module is used for obtaining multi-mode feature vectors of each event data according to the relation feature vectors, the text feature codes and the image feature codes, and transmitting the multi-mode feature vectors into the hierarchical recognition module to obtain the level of each event data; and according to the CLIP model and the clustering algorithm, analyzing the association relation between the event data of the same level.
Compared with the prior art, the application has at least one of the following beneficial effects: external information is reasonably introduced, semantic relations between images and texts in the event data are supplemented according to the similarity of embedded vectors among the external information, comprehensive consideration of the event data characteristics is realized, and the accuracy of event data classification of the power system is improved; according to the level of the event data, the data of the appointed level or the data of the same level is subjected to association analysis in a targeted and selective manner, so that the data processing scale is reduced, and the loss of the data analysis on the system performance is reduced; through cluster analysis of the fusion feature vectors, a strong association relation of the data with the same level is established, and the usability of the data is improved.
In the application, the technical schemes can be mutually combined to realize more preferable combination schemes. Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the application, like reference numerals being used to refer to like parts throughout the several views.
Fig. 1 is a flow chart of a data analysis method based on data classification in embodiment 1 of the present application.
Detailed Description
The following detailed description of preferred embodiments of the application is made in connection with the accompanying drawings, which form a part hereof, and together with the description of the embodiments of the application, are used to explain the principles of the application and are not intended to limit the scope of the application.
Example 1
In one embodiment of the present application, a data analysis method based on data classification is disclosed, as shown in fig. 1, comprising the following steps:
s11, acquiring event data of the power system, wherein each piece of event data comprises an image and text content thereof.
Note that, at each key workstation of the converter station of the electric power company, for example: the monitoring cameras are installed on the operator workstation, the oil chromatograph workstation and the electric energy metering workstation, the video monitoring system is used for collecting image data of each key event, preprocessing is carried out to be uniform in size, and meanwhile, 1 image data at least corresponds to one text content.
Illustratively, the key events are: the gas chromatograph of the oil chromatography workstation overhauls, and the image data who gathers is overhauls scene image, and corresponding 3 text content has 3: the first gas chromatograph of the oil chromatograph workstation in 2018 month 6 has abnormal readings; entering an oil chromatography workstation by registering extraneous personnel for 6 months and 3 days; a worker overhauls the gas chromatograph.
S12, extracting each word segmentation and text feature codes in the text content according to the text feature extractor; and extracting each target and image feature code in the image by using an image feature extractor.
It should be noted that the text feature extractor adopts a pretrained Bert model, and the input vector of the Bert model is composed of Token embedding, segment embedding and position embedding. Wherein Token embedding is by word segmentation { w } in each piece of text content using WordPiece 0 ,w 1 ,...,w t Token-based processing, while adding a classification flag [ CLS ] in front of a sentence sequence]For obtaining final representation of text features for subsequent classification, adding tokens [ SEP ] between sentences]To distinguish sentence context; segment embedding is used for representing sentences corresponding to each word segmentation; the position embedding is to label each word and determine the specific position of the word in the sequence.
The Bert model uses a multi-layer transducer as a main framework to learn rich context information, takes the output of the last hidden layer of a text feature extractor in the step, and sequentially passes through two linear layers, regularization treatment and an activation function to obtain a token code as a text feature code E text The expression is as follows:
E text =Tr{f 1 {w 0 ,w 1 ,...,w t }}=(x 1 ,x 2 ,...,x n ) Formula (1)
Wherein f 1 Representing the text preprocessing operation, tr represents the multi-layer transform processing, and n represents the text feature encoding length.
It should be noted that the image feature extractor adopts a pre-trained VisionTransformer (ViT) model to segment the input image into equal image blocks { v } 0 ,v 1 ,...,v M The sequence of images is then transformed into a sequence of vectors by linear projection of the layer into a low dimensional space, and the sequence of vectors is then input into a transform encoder for processing. Illustratively, the size of the input image is224 x 224, into 16 x 16 image blocks.
Similarly, the output of the last hidden layer of the image feature extractor is taken and sequentially subjected to two linear layers, regularization treatment and activation functions to obtain the encoded feature vector which is used as the image feature encoding representation E image
E image =Tr{f 2 {v 0 ,v 1 ,...,v m }}=(y 1 ,y 2 ,...,y n ) Formula (2)
Wherein f 2 Representing the preprocessing operation on the image block, tr represents that the length n of the image feature code is the same as the length of the text feature code after the multi-layer transform processing.
Further, each target in the image is identified according to the output result of the image feature extractor. Illustratively, in an image of an oil chromatography workstation gas chromatograph service, the target includes a gas chromatograph and a person.
S13, embedding the knowledge graph, namely respectively putting the word segmentation and the target existing in the knowledge graph into a text node set and an image node set, and acquiring embedded vectors of all the nodes; and constructing the relation feature vector of each node according to the similarity between the embedded vectors of each node.
Note that, a Knowledge Graph (KG) is a technique for describing association between knowledge and modeling things by using a graph model. Knowledge graph embedding (knowledge graphembedding, KGE) is a widely adopted knowledge representation method, the main idea of which is to embed entities and relations in a knowledge graph into a continuous vector space so as to quantify their similarity by a scoring function. Freebase is a vast, multi-domain knowledge graph dataset containing over 2.5 hundred million entities, collecting many entity attributes and relationship attribute information.
In the step of knowledge graph embedding, a TransE model is adopted to acquire embedding vectors of all entities in the Freebase knowledge graph data set. The TransE model, also known as a translation model, is a method by which entities and relationships on a large scale multi-relational data set can be represented in an embedded (ebedding) manner.
It should be noted that, according to the text feature extractor and the image feature extractor in step S12, the names of the words and the objects in the image of each text content are used as entity names, whether the words and the objects exist in the knowledge graph is identified, the words and the objects existing in the knowledge graph are respectively put into the text node and the image node set, and the embedded vector of each node is obtained. I.e. the word segmentation and the target in the node set can be aligned and matched with the knowledge graph.
For a node v according to cosine similarity i Get it and other nodes { v } 1 ,v 2 ,...,v k Similarity between { d } i1 ,d i2 ,...,d ik }。
Taking into account the similarity in the range of [ -1,1]The positive and negative values have different meanings, and the node belonging set comprises a text node set V T And image node set V I Therefore, { d i1 ,d i2 ,...,d ik Dividing the set of the compared nodes into four groups according to the positive and negative values of the similarity, calculating four relation values by the sum of the similarity of each group, and forming a four-dimensional vector as a node v i Is denoted as R i =(r i0 ,r i1 ,r i2 ,r i3 ) The specific calculation formula is as follows:
it should be noted that, the relation feature vector of the word and the target which do not exist in the knowledge graph is set as a four-dimensional vector R composed of zero z =(0,0,0,0)。
Through the steps, m participles in each text content and g targets in each image correspond to one relation feature vector.
Compared with the prior art, the method and the device have the advantages that external information is introduced by means of huge and multi-field knowledge maps, semantic relations between images and texts in event data are supplemented according to similarity of embedded vectors among the external information, comprehensive consideration of event data characteristics is achieved, and accuracy of data analysis is improved.
S14, according to the relation feature vector, the text feature code and the image feature code, obtaining multi-mode feature vectors of each event data, and transmitting the multi-mode feature vectors into a hierarchical recognition module to obtain the level of each event data; and according to the CLIP model and the clustering algorithm, analyzing the association relation between the event data of the same level.
It should be noted that, in this embodiment, the service standards of the national power industry and the professional field, and the service conditions of all parts of data in the power grid are referred in advance, and the hierarchical classification standards of the event data are formulated. The historical event data is marked by an expert according to the standard, and the level comprises: core, important, and generally three levels. The hierarchical recognition module comprises a linear layer and a softmax classification layer, and is trained by using a cross entropy loss function to obtain the trained hierarchical recognition module.
In an actual task, after obtaining newly added or to-be-identified event data according to step S11 and obtaining the present feature code, the image feature code and the relationship feature vector according to steps S12 and S13, obtaining a multi-mode feature vector of each event data through the following steps:
splicing the relation feature vector of each node in the text node set to the text feature coding head; splicing the relation feature vector of each node in the image node set to the image feature coding head; respectively passing the two spliced feature vectors through two linear layers, and outputting feature representations with the same dimension; and combining the two characteristic representations of the same event data to obtain the multi-mode characteristic vector.
And transmitting the multi-mode feature vectors of the event data into a hierarchical recognition module, and obtaining the level of the event data through a linear layer projection and a softmax layer.
The hierarchical recognition module outputs confidence degrees of all levels to each event data, the confidence degrees are expressed by confidence indexes, the value range is 0,1, and the level with the largest confidence degree is taken as a recognition result. If the maximum confidence index is less than 0.9, the secondary review is performed manually, thereby achieving accurate data classification and data classification.
The implementation carries out selective association analysis processing on the data of the designated level by determining the level of the event data, avoids the data processing of different levels, reduces the data processing scale, reduces the loss of the data analysis on the system performance, and has positive significance on maintaining the high performance of the system.
Preferably, the event data of the core or importance level is subjected to strong correlation analysis, the data range of analysis is reduced, and the information required by a user is quickly retrieved; in addition, the analysis of the data with the same level can also realize the inquiry of only the event data with the common level and the related data of the user with lower authority.
In the step, semantic links between images and texts in event data of the same level are learned by using a CLIP (Contrastive Language-Image Pre-Training based on a contrast text-Image pair) model, so as to obtain fusion feature vectors.
The sample set of the CLIP model was trained, the image dataset was ImageNet and COCO datasets, and the text dataset was Wikipedia and BookCorpus. The CLIP model contains two main components: a Convolutional Neural Network (CNN) for processing images and a transducer model for processing text. The CLIP model is trained using a contrast learning approach. The purpose of contrast learning is to learn how to map similar images and text samples into a similar embedding space, while dissimilar samples are mapped into a distant embedding space. The CLIP model uses different contrast loss functions to achieve this goal, such as the NT-Xent contrast loss function.
In this step, according to the CLIP model and the clustering algorithm, the association relationship between the event data of the same level is analyzed, including:
inputting images and text contents of event data of the same level into a pre-trained CLIP model to obtain a fusion feature vector of an input softmax layer; and classifying the fusion feature vectors by using a clustering algorithm, and establishing a strong association relation for the event data of the same class according to the classification result.
It should be noted that, the softmax layer is used to obtain the final output result of the CLIP model, and this step does not need to output the result, but obtains the feature vector for identifying the output result, so as to perform cluster analysis. After the strong association relation is established between event data of the same level, when a user searches a certain piece of core event data, the core event data which is strongly associated with the user is automatically output, and the availability of the data is improved. In addition, when the data catalog of the power system is constructed, related data can be stored or associated according to analysis results, so that enterprises can be helped to quickly identify and utilize key information in the data, and business decisions and potential opportunities can be better conducted.
Compared with the prior art, the data analysis method based on data classification provided by the embodiment reasonably introduces external information, supplements the semantic relation between images and texts in event data according to the similarity of embedded vectors between the external information, realizes comprehensive consideration of the characteristics of the event data, and improves the accuracy of the classification of the event data of the power system; according to the level of the event data, important or core data are selected selectively and selectively for correlation analysis, so that the data processing scale is reduced, and the loss of the data analysis on the system performance is reduced; through cluster analysis of the fusion feature vectors, a strong association relationship is established, and the usability of the data is improved.
Example 2
In another embodiment of the present application, a data analysis system based on data classification is disclosed, thereby implementing the data analysis method based on data classification in embodiment 1. The specific implementation of each module is described with reference to the corresponding description in embodiment 1. The system comprises:
the system comprises an event data acquisition module, a power system management module and a power system management module, wherein the event data acquisition module is used for acquiring event data of the power system, and each piece of event data comprises an image and text content thereof;
the event feature extraction module is used for extracting each word segmentation and text feature codes in the text content according to the text feature extractor; extracting each target and image feature code in the image by using an image feature extractor;
the joint relation extraction module is used for embedding the word segmentation and the target existing in the knowledge graph based on the knowledge graph, respectively putting the word segmentation and the target into the text node set and the image node set, and obtaining the embedded vector of each node; constructing a relationship feature vector of each node according to the similarity among the embedded vectors of each node;
the data hierarchical analysis module is used for obtaining multi-mode feature vectors of each event data according to the relation feature vectors, the text feature codes and the image feature codes, and transmitting the multi-mode feature vectors into the hierarchical recognition module to obtain the level of each event data; and according to the CLIP model and the clustering algorithm, analyzing the association relation between the event data of the core level.
Since the relevant points of the present embodiment and the data analysis method based on data classification can be referred to each other, the description is repeated here, and thus the description is omitted here. The principle of the system embodiment is the same as that of the method embodiment, so the system embodiment also has the corresponding technical effects of the method embodiment.
Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application.

Claims (10)

1. A data analysis method based on data classification, comprising the steps of:
acquiring event data of a power system, wherein each piece of event data comprises an image and text content thereof;
extracting each word segmentation and text feature codes in the text content according to the text feature extractor; extracting each target and image feature code in the image by using an image feature extractor;
based on the embedding of the knowledge graph, putting the word segmentation and the target existing in the knowledge graph into a text node set and an image node set respectively, and obtaining the embedding vector of each node; constructing a relationship feature vector of each node according to the similarity among the embedded vectors of each node;
according to the relation feature vector, the text feature code and the image feature code, a multi-mode feature vector of each event data is obtained, and the multi-mode feature vector is transmitted into a hierarchical recognition module to obtain the level of each event data; and according to the CLIP model and the clustering algorithm, analyzing the association relation between the event data of the same level.
2. The data analysis method based on data hierarchy according to claim 1, wherein the relational feature vector is a four-dimensional vector composed of four sets of relational values; the relationship values of the four groups are the sum of the similarities of the groups divided into four groups according to the positive and negative values of the similarities and the set to which the compared nodes belong.
3. The data analysis method based on data classification as claimed in claim 2, wherein the method further comprises: and setting the relation feature vector of the word segmentation and the target which are not in the knowledge graph as a four-dimensional vector consisting of zero.
4. A data analysis method based on data classification as claimed in claim 3, wherein said obtaining a multi-modal feature vector for each event data based on the relationship feature vector, the text feature code and the image feature code comprises:
splicing the relation feature vector of each node in the text node set to the text feature coding head; splicing the relation feature vector of each node in the image node set to the image feature coding head; respectively passing the two spliced feature vectors through two linear layers, and outputting feature representations with the same dimension; and combining the two characteristic representations of the same event data to obtain the multi-mode characteristic vector.
5. The data analysis method based on data classification according to claim 4, wherein the classification recognition module obtains the level of each event data through linear layer projection and then softmax layer after receiving the multi-modal feature vector of each event data; the levels include core, importance and general.
6. The data analysis method based on data classification according to claim 1 or 5, wherein the analyzing the association relationship between the event data of the same level according to the CLIP model and the clustering algorithm comprises:
inputting images and text contents of event data of the same level into a pre-trained CLIP model to obtain a fusion feature vector of an input softmax layer;
and classifying the fusion feature vectors by using a clustering algorithm, and establishing a strong association relation for the event data of the same class according to the classification result.
7. The data analysis method based on data classification according to claim 1, wherein the text feature code and the image feature code have the same length, which is obtained by sequentially subjecting the outputs of the last hidden layer in the text feature extractor and the image feature extractor to two linear layers, a regularization process, and an activation function, respectively.
8. The data analysis method based on data classification according to claim 1, wherein the knowledge-graph embedding is to acquire embedding vectors of entities in a Freebase knowledge-graph dataset by using a transition model.
9. The data-hierarchy-based data analysis method of claim 1, wherein the text feature extractor uses a pretrained Bert model and the image feature extractor uses a pretrained vision transform model.
10. A data analysis system based on data classification, comprising:
the system comprises an event data acquisition module, a power system management module and a power system management module, wherein the event data acquisition module is used for acquiring event data of the power system, and each piece of event data comprises an image and text content thereof;
the event feature extraction module is used for extracting each word segmentation and text feature codes in the text content according to the text feature extractor; extracting each target and image feature code in the image by using an image feature extractor;
the joint relation extraction module is used for embedding the word segmentation and the target existing in the knowledge graph based on the knowledge graph, respectively putting the word segmentation and the target into the text node set and the image node set, and obtaining the embedded vector of each node; constructing a relationship feature vector of each node according to the similarity among the embedded vectors of each node;
the data hierarchical analysis module is used for obtaining multi-mode feature vectors of each event data according to the relation feature vectors, the text feature codes and the image feature codes, and transmitting the multi-mode feature vectors into the hierarchical recognition module to obtain the level of each event data; and according to the CLIP model and the clustering algorithm, analyzing the association relation between the event data of the same level.
CN202310830398.0A 2023-07-07 2023-07-07 Data analysis method and system based on data classification Pending CN117077071A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310830398.0A CN117077071A (en) 2023-07-07 2023-07-07 Data analysis method and system based on data classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310830398.0A CN117077071A (en) 2023-07-07 2023-07-07 Data analysis method and system based on data classification

Publications (1)

Publication Number Publication Date
CN117077071A true CN117077071A (en) 2023-11-17

Family

ID=88706881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310830398.0A Pending CN117077071A (en) 2023-07-07 2023-07-07 Data analysis method and system based on data classification

Country Status (1)

Country Link
CN (1) CN117077071A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473431A (en) * 2023-12-22 2024-01-30 青岛民航凯亚***集成有限公司 Airport data classification and classification method and system based on knowledge graph

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473431A (en) * 2023-12-22 2024-01-30 青岛民航凯亚***集成有限公司 Airport data classification and classification method and system based on knowledge graph

Similar Documents

Publication Publication Date Title
CN106649715B (en) A kind of cross-media retrieval method based on local sensitivity hash algorithm and neural network
CN109271539B (en) Image automatic labeling method and device based on deep learning
CN112818906A (en) Intelligent full-media news cataloging method based on multi-mode information fusion understanding
CN115329127A (en) Multi-mode short video tag recommendation method integrating emotional information
CN116049397B (en) Sensitive information discovery and automatic classification method based on multi-mode fusion
CN117077071A (en) Data analysis method and system based on data classification
CN116484024A (en) Multi-level knowledge base construction method based on knowledge graph
CN115017425B (en) Location search method, location search device, electronic device, and storage medium
CN115526236A (en) Text network graph classification method based on multi-modal comparative learning
CN113656561A (en) Entity word recognition method, apparatus, device, storage medium and program product
CN114254102B (en) Natural language-based collaborative emergency response SOAR script recommendation method
CN113742396A (en) Mining method and device for object learning behavior pattern
CN115797795B (en) Remote sensing image question-answer type retrieval system and method based on reinforcement learning
CN117009516A (en) Converter station fault strategy model training method, pushing method and device
WO2023168818A1 (en) Method and apparatus for determining similarity between video and text, electronic device, and storage medium
CN115953041A (en) Construction scheme and system of operator policy system
Feng et al. ModelsKG: A Design and Research on Knowledge Graph of Multimodal Curriculum Based on PaddleOCR and DeepKE
CN114842301A (en) Semi-supervised training method of image annotation model
CN114756679A (en) Chinese medical text entity relation combined extraction method based on conversation attention mechanism
CN113886602A (en) Multi-granularity cognition-based domain knowledge base entity identification method
Duro et al. Boosting the automated information processing for reconnaissance
CN114595693A (en) Text emotion analysis method based on deep learning
Das et al. Incorporating domain knowledge to improve topic segmentation of long MOOC lecture videos
Yu et al. Workflow recommendation based on graph embedding
Xiang et al. Document similarity detection based on multi-feature semantic fusion and concept graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination