CN114996360B

CN114996360B - Data analysis method, system, readable storage medium and computer equipment

Info

Publication number: CN114996360B
Application number: CN202210851818.9A
Authority: CN
Inventors: 章建群; 樊振军
Original assignee: Jiangxi Modern Polytechnic College
Current assignee: Jiangxi Modern Polytechnic College
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-11-18
Anticipated expiration: 2042-07-20
Also published as: CN114996360A

Abstract

The invention provides a data analysis method, a data analysis system, a readable storage medium and computer equipment, wherein the method comprises the following steps: determining a data extraction mode based on the data type of the data to be analyzed sent by the terminal equipment, and extracting each difference data and basic data in the data to be analyzed by using the data extraction mode; combining basic data with the same basic indexes to obtain a plurality of basic data sets; capturing data field blocks of each basic data set, extracting characteristic indexes in each distinguishing data, and calculating the frequency of each characteristic index appearing in each data field block; splitting data to be analyzed into a plurality of subdata to be analyzed according to each frequency; and calculating the priority coefficient of each subdata to be analyzed, and sequentially analyzing the data of each subdata to be analyzed according to the priority coefficient. The invention can process each subdata to be analyzed in sequence according to the priority coefficient, thereby avoiding the problems that the analysis rate is influenced and the analysis time is increased by synchronously executing the tasks.

Description

Data analysis method, system, readable storage medium and computer device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data analysis method, a data analysis system, a readable storage medium, and a computer device.

Background

With the rapid development of science and technology and the improvement of living standard of people, various data are also in endless, so that with the rapid growth of various data, the analysis demand on the data is higher and higher.

When data information is analyzed, all data is generally brought into various databases by computer equipment for all-round analysis, all analysis results are displayed after all-round analysis is completed, however, with the increase of the data quantity of the databases and the data to be analyzed, all data contents need to be analyzed in the process of data analysis, the data to be analyzed with larger data quantity needs to have corresponding data development capability for an analyst, and the phenomena of heating, blocking and even crash of equipment can be caused due to the fact that more running space of the computer equipment is occupied in the process of data analysis, and further the analysis time is longer, and the working efficiency and the analysis efficiency are influenced.

Disclosure of Invention

Based on this, the present invention aims to provide a data analysis method, a system, a readable storage medium and a computer device, so as to solve at least the deficiencies in the above technology.

The invention provides a data analysis method, which comprises the following steps:

receiving data to be analyzed sent by terminal equipment, determining a corresponding data extraction mode based on the data type of the data to be analyzed, and extracting each difference data and basic data in the data to be analyzed by using the data extraction mode;

combining the basic data with the same basic indexes in each basic data to obtain a plurality of basic data sets;

capturing data field blocks corresponding to the basic data sets, extracting characteristic indexes in the distinguishing data, and calculating the frequency of the characteristic indexes in the data field blocks by using a hash table algorithm;

dividing the data to be analyzed into a plurality of subdata to be analyzed according to the frequency of each characteristic index appearing in each data field block;

and calculating the priority coefficient of each subdata to be analyzed, and sequentially performing data analysis on each subdata to be analyzed according to the priority coefficient of each subdata to be analyzed.

Further, before the step of determining the corresponding data extraction manner based on the data type of the data to be analyzed, the method further includes:

extracting a character string sequence with a unique identifier in the data to be analyzed;

and inputting the character string sequence into a preset character string sequence table for data comparison, and determining the data type of the data to be analyzed according to a data comparison result.

Further, the step of determining a corresponding data extraction manner based on the data type of the data to be analyzed, and extracting each difference data and the basic data in the data to be analyzed by using the data extraction manner includes:

when the data type of the data to be analyzed is text data, preprocessing the data to be analyzed;

representing the preprocessed data to be analyzed as a numerical vector by using a bag-of-words model to obtain a numerical characteristic matrix of the data to be analyzed;

and obtaining each difference data and basic data in the data to be analyzed according to the numerical characteristic matrix.

when the data type of the data to be analyzed is image data, smoothing the data to be analyzed to obtain first processed data;

calculating the gradient amplitude and the gradient direction of the first processing data by using a finite difference method, and performing non-maximum suppression processing on the gradient amplitude to obtain second processing data;

and extracting pixel points and edges of the second processed data to obtain each distinguishing data and basic data of the second processed data.

Further, the step of combining the basic data with the same basic index in each basic data to obtain a plurality of basic data sets includes:

extracting key data of each basic data by using a key database to obtain the key data of each basic data;

performing similar characteristic comparison on the key data of each basic data by using a standard database to obtain similar characteristic measurement of each basic data;

and clustering the basic data according to the similar characteristic metric of the basic data to obtain a plurality of basic data sets.

Further, the step of calculating the priority coefficient of each sub-data to be analyzed includes:

acquiring the shipping space of a processor, and calculating the numerical value of the shipping space of the processor occupied by each subdata to be analyzed based on the shipping space of the processor;

and calculating the priority coefficient of each subdata to be analyzed according to the numerical value of the operation and storage space occupied by each subdata to be analyzed.

The invention also provides a data analysis system, comprising:

the data extraction module is used for receiving data to be analyzed sent by terminal equipment, determining a corresponding data extraction mode based on the data type of the data to be analyzed, and extracting each difference data and basic data in the data to be analyzed by using the data extraction mode;

the data combination module is used for combining basic data with the same basic indexes in each basic data to obtain a plurality of basic data sets;

the data calculation module is used for capturing the data field blocks corresponding to the basic data sets, extracting the characteristic indexes in the distinguishing data, and calculating the frequency of the characteristic indexes in the data field blocks by using a hash table algorithm;

the data splitting module is used for splitting the data to be analyzed into a plurality of subdata to be analyzed according to the frequency of each characteristic index appearing in each data field block;

and the data analysis module is used for calculating the priority coefficient of each subdata to be analyzed and sequentially carrying out data analysis on each subdata to be analyzed according to the priority coefficient of each subdata to be analyzed.

Further, the system further comprises:

the character string extraction module is used for extracting a character string sequence with a unique identifier in the data to be analyzed;

and the data comparison module is used for inputting the character string sequence into a preset character string sequence table for data comparison and determining the data type of the data to be analyzed according to the data comparison result.

Further, the data extraction module comprises:

the preprocessing unit is used for preprocessing the data to be analyzed when the data type of the data to be analyzed is text data;

the data representation unit is used for representing the preprocessed data to be analyzed into a numerical vector by using a bag-of-words model so as to obtain a numerical characteristic matrix of the data to be analyzed;

and the data extraction unit is used for obtaining each difference data and basic data in the data to be analyzed according to the numerical characteristic matrix.

Further, the data extraction module comprises:

the smoothing unit is used for smoothing the data to be analyzed to obtain first processed data when the data type of the data to be analyzed is image data;

the data calculation unit is used for calculating the gradient amplitude and the gradient direction of the first processing data by using a finite difference method and carrying out non-maximum suppression processing on the gradient amplitude to obtain second processing data;

and the data processing unit is used for extracting pixel points and edges of the second processed data to obtain each difference data and basic data of the second processed data.

Further, the data combination module includes:

the key data extraction unit is used for extracting key data of each basic data by using a key database to obtain the key data of each basic data;

the similar characteristic comparison unit is used for performing similar characteristic comparison on the key data of each basic data by using a standard database to obtain similar characteristic measurement of each basic data;

and the data clustering unit is used for clustering the basic data according to the similarity characteristic measurement of the basic data to obtain a plurality of basic data sets.

Further, the data analysis module comprises:

the device comprises a running space obtaining unit, a data analysis unit and a data analysis unit, wherein the running space obtaining unit is used for obtaining the running space of a processor and calculating the numerical value of the running space of the processor occupied by the subdata to be analyzed based on the running space of the processor;

and the priority calculating unit is used for calculating the priority coefficient of each subdata to be analyzed according to the numerical value of the operation and storage space of the processor occupied by each subdata to be analyzed.

The invention also proposes a readable storage medium on which a computer program is stored which, when executed by a processor, implements the data analysis method described above.

The invention also provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the data analysis method when executing the computer program.

The data analysis method, the data analysis system, the readable storage medium and the computer equipment divide the data to be analyzed into a plurality of distinguishing data and basic data, combine the basic data with the same basic indexes, calculate the frequency in the data field block corresponding to the basic data set by using the characteristic indexes of the distinguishing data so as to divide the data to be analyzed into a plurality of subdata to reduce the single-time processing data volume of the data to be analyzed, further avoid the problems of influencing the analysis result and reducing the analysis rate due to overlarge data volume, and calculate the priority coefficient of each subdata to be analyzed, so that the processing of each subdata to be analyzed can be sequentially performed according to the priority coefficient, and further avoid the problems of influencing the analysis rate and increasing the analysis time due to synchronously executing tasks.

Drawings

FIG. 1 is a flow chart of a data analysis method in a first embodiment of the present invention;

FIG. 2 is a detailed flowchart of step S101 in FIG. 1;

FIG. 3 is a detailed flowchart of another embodiment of step S101 in FIG. 1;

FIG. 4 is a detailed flowchart of step S102 in FIG. 1;

FIG. 5 is a detailed flowchart of step S105 in FIG. 1;

FIG. 6 is a flow chart of a data analysis method in another embodiment of the present invention;

FIG. 7 is a block diagram showing the construction of a data analysis system according to a second embodiment of the present invention;

fig. 8 is a block diagram showing a configuration of a computer device in a third embodiment of the present invention.

Description of the main element symbols:

the following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

To facilitate an understanding of the invention, the invention will now be described more fully hereinafter with reference to the accompanying drawings. Several embodiments of the invention are presented in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for purposes of illustration only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Example one

Referring to fig. 1, a data analysis method according to a first embodiment of the present invention is shown, and the method specifically includes steps S101 to S105:

s101, receiving data to be analyzed sent by terminal equipment, determining a corresponding data extraction mode based on the data type of the data to be analyzed, and extracting each difference data and basic data in the data to be analyzed by using the data extraction mode;

specifically, referring to fig. 2, the step S101 specifically includes steps S1011 to S1013:

s1011, when the data type of the data to be analyzed is text data, preprocessing the data to be analyzed;

s1012, representing the preprocessed data to be analyzed as numerical vectors by using a bag-of-words model to obtain a numerical characteristic matrix of the data to be analyzed;

and S1013, obtaining each difference data and basic data in the data to be analyzed according to the numerical characteristic matrix.

It should be noted that, the terminal device is a device having a communication function, such as a mobile phone, a tablet computer, a notebook computer, and a user interact with each other through the terminal device, and whether the user interacts with the user or the user interacts with the server, data to be analyzed is sent, and the data to be analyzed may have a plurality of data types, which generally include a text type and/or an image type, and in some other embodiments, the data type of the data to be analyzed may also be an audio type, a video type, and the like, for example: when a user analyzes certain data by utilizing a retrieval website such as Baidu, data content which may be characters, pictures, music, videos and the like is input in a corresponding text box, and when the data content is uploaded to a server, the data content is correspondingly marked as a text type, an image type, an audio type, a video type and the like.

In this embodiment, when the data to be analyzed is received and sent from the terminal device, the terminal device edits the generated code sequence with the data mark, and identifies the code sequence according to the code library to determine that the data type of the data to be analyzed belongs to a text type or an image type or a text and an image type.

When the data type of the data to be analyzed is text data, it means that the data to be analyzed is plain text information, and since there are many unnecessary contents after the plain text information is edited by the terminal device, the data to be analyzed needs to be preprocessed first, in this embodiment, an unnecessary data tag in the book to be analyzed is recognized first, for example: the data tags do not have value for data analysis, and therefore, the data tags are deleted to reduce analysis actions on the data tags during data analysis.

It should be noted that, in any text message, there are usually accented characters/letters, and in this embodiment, it is necessary to convert and normalize the accented characters/letters in the data to be analyzed into ASCII characters. For example: when the data to be analyzed is "weather", accent characters/letters of "im" exist in the semantic description thereof, and thus, the "im" is converted into "i" for more accurate recognition by the server.

Second, there may be abbreviations in the text messages, such as: in english text, the abbreviation is usually in the form of an abbreviation of a word or syllable. In this embodiment, it is recognized whether there is a abbreviation in all vocabularies in the data to be analyzed, and the corresponding abbreviation is converted into its expanded original form by using a pre-stored abbreviation library, so as to facilitate text standardization and recognition accuracy.

In this embodiment, in order to avoid the influence of the special characters on the text recognition, in this embodiment, special character deletion is performed on the vocabulary in the data to be analyzed by using a special character library, and is implemented by using advanced regular expressions (regexes).

In some optional embodiments, the data to be analyzed may be preprocessed by spelling correction, syntax error correction, or the like, so as to improve the recognition accuracy of the text information and the consistency of the data.

Further, in this embodiment, the words of the big data are combined, a big data bag-of-words model is constructed, the bag-of-words model is used to perform vector representation on the processed data to be analyzed, and each data to be analyzed is represented as a numerical value vector, wherein the dimension of each numerical value vector is from a specific word in the bag-of-words model, and the value of the word is a weight value or a frequency value (represented by 1 or 0) in the data to be analyzed, and then is converted into a numerical value feature matrix according to the weight value or the frequency value.

In the above numerical characteristic matrix, when the word frequency of a weighted value or a frequency value of a certain vocabulary in the data to be analyzed is larger, the word frequency of the certain vocabulary in the data to be analyzed is judged to be the distinguishing data or the basic data by combining the character recognition of the text database to the data to be analyzed, in this embodiment, when the word frequency of the weighted value or the frequency value of the certain vocabulary in the data to be analyzed is larger than the basic data threshold (50% in this embodiment), it means that the vocabulary is the basic data, and when the word frequency of the weighted value or the frequency value of the certain vocabulary in the data to be analyzed is smaller than the distinguishing data threshold (10% in this embodiment), it means that the vocabulary is the distinguishing data.

Further, referring to fig. 3, in another optional embodiment, the step S101 specifically includes steps S1111 to S1113:

s1111, when the data type of the data to be analyzed is image data, smoothing the data to be analyzed to obtain first processed data;

s1112, calculating a gradient amplitude and a gradient direction of the first processed data by using a finite difference method, and performing non-maximum suppression processing on the gradient amplitude to obtain second processed data;

and S1113, performing pixel point and edge extraction on the second processed data to obtain each difference data and basic data of the second processed data.

In specific implementation, when data to be analyzed is image data, the data to be analyzed is image information, the image information includes color features and texture features of the image, in this embodiment, the image is subjected to gray processing, then a gaussian filtering algorithm is used for smoothing the gray-processed image, the gray-processed image is subjected to weighted average, the value of each pixel point is obtained by performing weighted average on the pixel point and other pixel values in the field, so as to eliminate gaussian noise in the image, in specific implementation, each pixel of the gray-scale image is scanned by using a trained convolutional neural network model, and the weighted average gray value of the pixel in a neighborhood determined by mask information of the gray-scale image is used for replacing the value of a central pixel point of the mask information, so as to obtain first processed data (i.e., the smoothed image).

Further, the gradient amplitude and the gradient direction of the first processing data are processed by a finite difference method, wherein the gradient is a two-dimensional equivalent of the finite difference method and is defined as a vector; processing the gradient amplitude value through a non-maximum suppression algorithm to obtain second processing data (namely, a picture processed through the non-maximum suppression algorithm);

and detecting and connecting the edge of the second processing data by using a double-threshold algorithm, extracting the edge and pixel points of the second processing data, taking a pixel point set which is fit into complete data in the second processing data as distinguishing data, and taking the edge data as basic data. For example: when a certain picture containing an animal contour and a building is analyzed, after the picture is processed, the animal contour, the building contour, a non-animal contour and a building contour area can appear in the picture, the texture information of the animal contour and the building contour is obtained by utilizing edge detection, the texture information is distinguishing data, and the information of the rest areas is basic data.

S102, combining basic data with the same basic indexes in each basic data to obtain a plurality of basic data sets;

further, referring to fig. 4, the step S102 specifically includes steps S1021 to S1023:

s1021, extracting key data of each basic data by using a key database to obtain the key data of each basic data;

s1022, performing similar feature comparison on the key data of each basic data by using a standard database to obtain similar feature measurement of each basic data;

and S1023, clustering the basic data according to the similar feature metric of the basic data to obtain a plurality of basic data sets.

It should be noted that, in this embodiment, the data tree of each basic data is captured through the data tree capture rule, the data tree is all components of the corresponding basic data, including the data classification head and the data trunk, the data classification head of each basic data is extracted by using the key data extraction rule of the key database, and the extracted result is used to search for the similar data classification head in the standard database, so as to calculate the similarity feature metric of each basic data.

It can be understood that the standard database includes data classification heads of all data, similarity comparison is performed between the data classification head of each obtained basic data and all data classification heads in the standard database, similarity feature measurement of the data classification head of each basic data is calculated, and basic data with the same similarity feature measurement of the data classification head of each basic data is used as the same or similar data to be clustered, so as to obtain a plurality of basic data sets.

It should be understood that, in other alternative embodiments, the key data can also be stored in the key database in advance, and when the basic data is obtained, the key data corresponding to the basic data is directly extracted from the key database.

S103, capturing data field blocks corresponding to the basic data sets, extracting characteristic indexes in the distinguishing data, and calculating the frequency of the characteristic indexes in the data field blocks by using a hash table algorithm;

it should be noted that the basic data set includes a data tree of basic data with the same similarity feature metric, a data field block of the basic data exists in a data body of the data tree, and when the basic data set is acquired, the data body of all the basic data in the basic data set is analyzed into a plurality of data charactersX ₁ 、X ₂ 、X ₃ And in a database Y containing data information of each equipment terminal _n For the above data characterX ₁ 、X ₂ 、X ₃ Grabbing to obtain the character containing the dataX ₁ 、X ₂ 、X ₃ All data field blocks, and according to the data word contained thereinX ₁ 、X ₂ 、X ₃ The number of the data field blocks is less than or equal to the number of the data field blocks, and all the data field blocks are sequenced to obtain sequenced data field blocksY ₁ 、Y ₂ 、Y ₃ 、…Y _n 。

Further, extracting characteristic indexes with analytical significance in the obtained distinguishing data, and when the data to be analyzed is text data, taking a plurality of specific words with analytical significance in the obtained distinguishing data as the characteristic indexes; when the data to be analyzed is image data, using texture information with analytical significance in the obtained distinguishing data as a characteristic index; wherein, the meaning with analytical significance is as follows: when a certain vocabulary in the distinguishing data can represent other vocabularies in the distinguishing data, the represented vocabularies are removed, and the other vocabularies are used as characteristic indexes; when the meaning represented by certain texture information in the distinguishing data is common texture information in various image data, the texture information is removed, and the rest texture information is used as a characteristic index.

By way of example and not limitation, for example: when the data to be analyzed is "annual sales volume and corresponding risk rating of each vehicle model of a certain brand of vehicle", when the data to be analyzed is processed, the obtained distinguishing data are "a certain brand", "each vehicle model", "annual sales volume" and "risk rating", respectively, however, "each vehicle model" can already represent the brand name of its corresponding "certain brand", and therefore, the "certain brand" is removed, and "each vehicle model", "annual sales volume" and "risk rating" are used as characteristic indexes;

when the data to be analyzed is a picture and the picture is processed, the obtained texture information comprises 'windmill', 'grassland', 'cloud' and the like, however, in the picture data, the texture information of the 'grassland' and the 'cloud' belongs to common texture information, the texture information of the 'grassland' and the 'cloud' is removed, and the 'windmill' texture information is used as a characteristic index.

Specifically, the HashMap algorithm is used for calculating the blocks of the obtained characteristic indexes in the data fieldY ₁ 、Y ₂ 、Y ₃ 、…Y _n Frequency of occurrence inV _（i）：

V _a1 、V _a2 、V _a3 …V _an ；

V _b1 、V _b2 、V _b3 …V _bn ；

…

V _z1 、V _z2 、V _z3 …V _zn ；

In the formula, i tableEach characteristic index is represented as i = a, b, c…z。

S104, dividing the data to be analyzed into a plurality of subdata to be analyzed according to the frequency of each characteristic index appearing in each data field block;

in specific implementation, the obtained frequency is usedV _（i） The data greater than the frequency threshold (in this embodiment, the frequency threshold is 70, which may be set by a user or a preset value) is sorted, and the data to be analyzed is split into a plurality of sub-data to be analyzed according to the sorting result.

And S105, calculating the priority coefficient of each subdata to be analyzed, and sequentially analyzing the data of each subdata to be analyzed according to the priority coefficient of each subdata to be analyzed.

Further, referring to fig. 5, the step S105 specifically includes steps S1051 to S1052:

s1051, obtaining the shipping space of a processor, and calculating the numerical value of the shipping space of the processor occupied by each sub-data to be analyzed based on the shipping space of the processor;

and S1052, calculating the priority coefficient of each subdata to be analyzed according to the numerical value of the operation and storage space of the processor occupied by each subdata to be analyzed.

In a specific implementation, a shipping space of a corresponding processor (in this embodiment, the processor is a cloud server, and in other embodiments, the processor may be a terminal device having a data processing function) is obtained, a size of the shipping space occupied by each to-be-analyzed sub data is calculated based on the shipping space, and a priority coefficient of each to-be-analyzed sub data is calculated by using the size of the shipping space occupied by each to-be-analyzed sub data, in this embodiment, a mapping table of the size of the shipping space occupied by each to-be-analyzed sub data and a priority coefficient is constructed in advance, and when the size of the shipping space occupied by each to-be-analyzed sub data is calculated, a priority coefficient corresponding to each to-be-analyzed sub data can be obtained through the mapping table, where an example of the mapping table of the size of the shipping space occupied by each to-be-analyzed sub data and the priority coefficient is as follows (taking the shipping space 100MB as an example):

by way of example and not limitation, assuming that when the memory space of the processor is 100MB, the values of the memory space required to be used by each sub-data to be analyzed (a, B, C, D) are 20 MB, 50MB, 80MB, and 100MB, respectively, the above values are compared with the above memory space to obtain the percentage data of each sub-data to be analyzed (a, B, C, D) corresponding to 20%, 50%, 80%, and 100%, respectively, and the priority coefficients of the corresponding sub-data to be analyzed (a, B, C, D) corresponding to 1.5, 1, 0.8, and 0.5. Therefore, when the larger the operation space occupied by the sub-data to be analyzed, the smaller the corresponding priority coefficient.

And sequencing the priority coefficients of the subdata to be analyzed from the big priority coefficients to the small priority coefficients, and sequentially analyzing the data of the subdata to be analyzed according to the sequencing sequence.

It can be understood that, the data are analyzed one by one in a priority coefficient mode, so that the phenomena of heating, blocking and even dead halt of a processor caused by overlarge data are avoided, and further, the analysis time is longer, and the working efficiency and the analysis efficiency are influenced.

As another embodiment of the present invention, referring to fig. 6, before step S101, the method further includes steps S201 to S202:

s201, extracting a character string sequence with a unique identifier in the data to be analyzed;

s202, inputting the character string sequence into a preset character string sequence table for data comparison, and determining the data type of the data to be analyzed according to the data comparison result.

It should be noted that, in this embodiment, when each piece of data to be analyzed is sent from the terminal device, the piece of data to be analyzed is converted into corresponding encoded data by using a specific preprocessing method, a string sequence having a unique identifier of the piece of data to be analyzed may exist in the encoded data, and by extracting the string sequence having the unique identifier and comparing the string sequence with a preset string sequence table, it can be understood that the string sequence table is a string sequence table of various data types, including a text type, an image type, an audio type, a video type, and the like; when the character string sequence of the unique identifier can correspond to a sequence in a preset character string sequence table, the data type of the data to be analyzed can be confirmed.

In summary, in the data analysis method in the above embodiments of the present invention, the data to be analyzed is divided into multiple distinct data and basic data, the basic data with the same basic index is combined, the frequency is calculated in the data field block corresponding to the basic data set by using the characteristic index of the distinct data, so that the data to be analyzed is split into multiple subdata, thereby reducing the single-processing data amount of the data to be analyzed, further avoiding the problems of affecting the analysis result and reducing the analysis rate due to an excessive data amount, and calculating the priority coefficient of each subdata to be analyzed, so that the processing on each subdata to be analyzed can be performed sequentially according to the priority coefficient, thereby avoiding the problems of affecting the analysis rate and increasing the analysis time due to the synchronous execution of tasks.

Example two

In another aspect, referring to fig. 7, a data analysis system according to a second embodiment of the present invention is further provided, where the data analysis system includes:

the data extraction module 11 is configured to receive data to be analyzed sent by a terminal device, determine a corresponding data extraction manner based on a data type of the data to be analyzed, and extract each difference data and basic data in the data to be analyzed by using the data extraction manner;

further, the data extraction module 11 includes:

In other embodiments, the data extraction module 11 includes:

The data combination module 12 is configured to combine basic data with the same basic index in each basic data to obtain multiple basic data sets;

further, the data combination module 12 includes:

A data calculation module 13, configured to capture data field blocks corresponding to each basic data set, extract feature indicators in each difference data, and calculate, by using a hash table algorithm, the frequency of occurrence of each feature indicator in each data field block;

a data splitting module 14, configured to split the data to be analyzed into multiple sub-data to be analyzed according to the frequency of occurrence of each characteristic indicator in each data field block;

and the data analysis module 15 is configured to calculate a priority coefficient of each to-be-analyzed subdata, and perform data analysis on each to-be-analyzed subdata in sequence according to the priority coefficient of each to-be-analyzed subdata.

Further, the data analysis module 15 includes:

the device comprises a running space obtaining unit, a data analysis unit and a data analysis unit, wherein the running space obtaining unit is used for obtaining the running and storing space of a processor and calculating the numerical value of the running and storing space of the processor occupied by each sub data to be analyzed based on the running and storing space of the processor;

and the priority calculating unit is used for calculating the priority coefficient of each subdata to be analyzed according to the numerical value of the transportation and storage space occupied by each subdata to be analyzed.

In other embodiments, the system further comprises:

The functions or operation steps of the modules and units when executed are substantially the same as those of the method embodiments, and are not described herein again.

The data analysis system provided by the embodiment of the present invention has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, no mention is made in the system embodiments, and reference may be made to the corresponding contents in the foregoing method embodiments.

EXAMPLE III

Referring to fig. 8, a computer device according to a third embodiment of the present invention is shown, which includes a memory 10, a processor 20, and a computer program 30 stored in the memory 10 and executable on the processor 20, where the processor 20 implements the data analysis method when executing the computer program 30.

The memory 10 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 10 may in some embodiments be an internal storage unit of the computer device, for example a hard disk of the computer device. The memory 10 may also be an external storage device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 10 may also include both an internal storage unit and an external storage device of the computer apparatus. The memory 10 may be used not only to store application software installed in the computer device and various kinds of data, but also to temporarily store data that has been output or will be output.

In some embodiments, the processor 20 may be an Electronic Control Unit (ECU), a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip, and is configured to run program codes stored in the memory 10 or process data, such as executing an access restriction program.

It should be noted that the configuration shown in fig. 8 does not constitute a limitation of the computer device, and in other embodiments the computer device may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

An embodiment of the present invention further provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data analysis method as described above.

Those of skill in the art will understand that the logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be viewed as implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

All possible combinations of the technical features of the above embodiments may not be described for the sake of brevity, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of data analysis, comprising:

receiving data to be analyzed sent by terminal equipment, determining a corresponding data extraction mode based on the data type of the data to be analyzed, and extracting each difference data and basic data in the data to be analyzed by using the data extraction mode, wherein when the data type of the data to be analyzed is text data, the basic data is a vocabulary with the word frequency of a weighted value or frequency value of a certain vocabulary in a numerical characteristic matrix of the data to be analyzed being greater than a basic data threshold value, the difference data is a vocabulary with the word frequency difference data threshold value of a weighted value or frequency value of a certain vocabulary in a numerical characteristic matrix of the data to be analyzed being less than the data to be analyzed, and when the data type of the data to be analyzed is image data, the texture information of the animal outline and the building outline of the data to be analyzed after image processing is the difference data, and the information of other areas is basic data;

combining basic data with the same basic indexes in each basic data to obtain a plurality of basic data sets;

calculating a priority coefficient of each subdata to be analyzed, and sequentially performing data analysis on each subdata to be analyzed according to the priority coefficient of each subdata to be analyzed;

the method comprises the following steps of combining basic data with the same basic indexes in each basic data to obtain a plurality of basic data sets:

comparing the similar characteristics of the key data of each basic data by using a standard database to obtain the similar characteristic measurement of each basic data;

clustering each basic data according to the similarity characteristic measurement of each basic data to obtain a plurality of basic data sets;

the step of capturing the data field block corresponding to each basic data set comprises the following steps:

when the basic data set is obtained, analyzing the data trunk of all basic data in the basic data set into a plurality of data charactersX ₁ 、X ₂ 、X ₃ And in a database Y containing data information of each equipment terminal _n For each of said data charactersX ₁ 、X ₂ 、X ₃ Grabbing to obtain a character containing each dataX ₁ 、X ₂ 、X ₃ All data field blocks of (1);

wherein, the step of calculating the priority coefficient of each subdata to be analyzed comprises:

and calculating the priority coefficient of each subdata to be analyzed according to the numerical value of the operation storage space of the processor occupied by each subdata to be analyzed.

2. The data analysis method of claim 1, wherein prior to the step of determining the corresponding data extraction manner based on the data type of the data to be analyzed, the method further comprises:

3. The data analysis method of claim 1, wherein the step of determining a corresponding data extraction manner based on the data type of the data to be analyzed and extracting each difference data and the basic data in the data to be analyzed using the data extraction manner comprises:

representing the preprocessed data to be analyzed as numerical vectors by utilizing a bag-of-words model so as to obtain a numerical characteristic matrix of the data to be analyzed;

and obtaining each distinguishing data and basic data in the data to be analyzed according to the numerical characteristic matrix.

4. The data analysis method of claim 1, wherein the step of determining a corresponding data extraction manner based on the data type of the data to be analyzed, and extracting each difference data and basic data in the data to be analyzed using the data extraction manner comprises:

and extracting pixel points and edges of the second processed data to obtain each difference data and basic data of the second processed data.

5. A data analysis system, comprising:

the data extraction module is used for receiving data to be analyzed sent by terminal equipment, determining a corresponding data extraction mode based on the data type of the data to be analyzed, and extracting each difference data and basic data in the data to be analyzed by using the data extraction mode, wherein when the data type of the data to be analyzed is text data, the basic data is a word with a weighted value or a frequency value of a certain word in a numerical characteristic matrix of the data to be analyzed, the word frequency of the word in the data to be analyzed, which appears in the numerical characteristic matrix of the data to be analyzed, being larger than a basic data threshold value, the difference data is a word with a weighted value or a frequency value of a certain word in the numerical characteristic matrix of the data to be analyzed, the word frequency of the word in the data to be analyzed, which appears in the data to be analyzed, being smaller than a difference data threshold value, and when the data type of the data to be analyzed is image data, the texture information of the animal outline and the building outline after the data to be analyzed are image processed is the difference data, and the information of the rest areas is the basic data;

the data combination module is used for combining the basic data with the same basic indexes in each basic data to obtain a plurality of basic data sets;

the data calculation module is used for capturing data field blocks corresponding to the basic data sets, extracting characteristic indexes in the distinguishing data, and calculating the frequency of the characteristic indexes in the data field blocks by using a hash table algorithm;

the data analysis module is used for calculating the priority coefficient of each subdata to be analyzed and sequentially carrying out data analysis on each subdata to be analyzed according to the priority coefficient of each subdata to be analyzed;

wherein the data combination module comprises:

the data clustering unit is used for clustering the basic data according to the similarity characteristic measurement of the basic data to obtain a plurality of basic data sets;

wherein the data calculation module is specifically configured to:

wherein the data analysis module comprises:

6. The data analysis system of claim 5, wherein the system further comprises:

7. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the data analysis method according to any one of claims 1 to 4.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data analysis method according to any one of claims 1 to 4 when executing the computer program.