CN112765238A

CN112765238A - Data processing system and data mining method

Info

Publication number: CN112765238A
Application number: CN202110101937.8A
Authority: CN
Inventors: 俞晓飞
Original assignee: Langda Network Technology Zhejiang Co ltd
Current assignee: Langda Network Technology Zhejiang Co ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-05-07
Also published as: WO2022160539A1

Abstract

The invention discloses a data processing system and a data mining method, belongs to the field of data processing, and is used for solving the problems that the existing data processing and mining system needs to mine complicated data, consumes long time and lacks a correlation processing technology; obtaining a binary string character, extracting key words in the binary string character, and marking the key words as definition problems; establishing data connection with a big data platform, searching by taking a defined problem as a keyword, and marking a search result as a database; when the same data occurs in different databases, marking the definition problems corresponding to the databases as related group problems; the data mining is carried out on the association group questions, so that the association can be realized in the data processing process, the time required by the data mining is reduced, and meanwhile, the keywords obtained by the data mining are all from the data processing system, thereby greatly reducing the time for standardizing the keywords and further shortening the mining time.

Description

Data processing system and data mining method

Technical Field

The invention belongs to the field of data processing, relates to a data processing and mining technology, and particularly relates to a data processing system and a data mining method.

Background

The data analysis refers to analyzing a large amount of collected data by using a proper statistical and analytical method, summarizing, understanding and digesting the data so as to maximally develop the function of the data and play the role of the data. Data analysis is the process of studying and summarizing data in detail to extract useful information and to form conclusions.

The data is also referred to as observation values and is the result of experiments, measurements, observations, investigations, and the like. The data processed in the data analysis is divided into qualitative data and quantitative data. Data that fall into only one category and cannot be measured numerically is called qualitative data. The qualitative data is represented as category, but is not sequential, and is classified data, such as gender, brand, and the like; the qualitative data is represented as categories, but is distinguished in sequence, and is sequencing data, such as academic calendar, quality grade of commodities and the like;

at present, with the popularization of big data technology, a large amount of texts and structured data are accumulated in the operation process of a plurality of industries, and no technology is used for processing and mining long text data, accurately predicting user behaviors, identifying and mining user requirements, improving user experience, improving client value and shortening user handling time.

Disclosure of Invention

The invention aims to provide a data processing system and a data mining method, which are used for solving the problems that the existing data processing and mining system needs to mine complicated data, consumes long time and is lack of a related processing technology.

The purpose of the invention can be realized by the following technical scheme:

a data processing system comprises a data preprocessing module, a distribution module, a fusion module, an auxiliary processing module and a main processing module;

the data preprocessing module comprises data preprocessing and data matching processing; the distribution module is used for distributing, packaging and dispatching data; the fusion module is used for carrying out fusion processing on the data; the main processing module is used for processing the fused data;

the data preprocessing module comprises data preprocessing and data matching processing, and specifically comprises the following steps:

the method comprises the following steps: acquiring data to be processed, and converting the data to be processed into standard string characters through a standard conversion module;

step two: comparing the standard character string with a preset character string stored in the module, and obtaining a comparison value;

step three: when the contrast value is greater than or equal to 95%, acquiring a preset processing flow corresponding to the preset character string, and sending the preset processing flow to the auxiliary processing module for processing;

step four: when the contrast value is less than 95%, the standard character string is sent to the main processing module for processing;

the conversion of the data to be processed into the standard string character through the standard conversion module is specifically to acquire the data to be processed, identify the data, convert the data into text messages when the data can be converted into text information, and sequentially convert characters in the text into binary string characters, wherein the binary string characters are the standard string characters;

comparing the standard string character with a preset string character stored in the module to obtain a comparison value, specifically, acquiring a key character in the binary string character, matching the key character with the key character in the preset string character, selecting the preset string character with the most successful keyword matching as a matching string character, comparing the matching string character with the binary string character, and obtaining the comparison value;

when the number of characters of the binary string is different from the number of characters of the matched string, selecting the last matched keyword as an end word;

when the end word appears in the binary string character, matching the subsequent binary string character of the end word with the preset string character again, marking the successfully matched preset string character as a second tail string character, repeating the operation, marking the successfully matched preset string character as a third tail string character, a fourth tail string character, … … and an Nth tail string character, and finishing the matching when no key word in the subsequent binary string character of the end word is the same as the preset string character;

and at the moment, connecting the matched string character with the matched string character, the third tail string, the fourth tail string and the Nth tail string to obtain a combined matched string character, and matching the characters in the matched string character with the characters in the binary string character, wherein the ratio of the number of the characters in the matched string character to the number of the characters in the binary string character is a contrast value.

Further, the distribution module is configured to compare keywords of the standard string tokens with the comparison value less than 95%, and distribute the keywords according to the comparison keywords, specifically:

extracting keywords in the standard character string, matching the keywords with the keywords stored in the standard character string, and distributing the standard character string with the same matching result to a corresponding data basket to complete distribution;

the data basket is composed of a plurality of storage units, wherein each different data basket is connected with different main processing modules, and each data basket is connected with a single main processing module.

Furthermore, a fusion module is connected between the data basket and the main processing module;

the fusion module performs data fusion on the standard character string signs in the data basket, specifically, obtains the standard character string signs in the data basket, extracts key words in the standard character string, and performs splicing fusion on different standard character string signs according to the key words, specifically, obtains the key words in the standard character string, matches the standard character string with the same key words, and performs data fusion on the successfully matched standard character string.

Further, the matching of the standard string identifiers with the same keyword and the data fusion of the successfully matched standard string identifiers are specifically to obtain key values in different standard string identifiers, so as to obtain the types and key values of the standard string identifiers; fusing each standard character string character based on the types and key values of different standard character string characters;

the obtaining of the key values in the different standard string identifiers specifically includes: classifying the standard string characters, and associating the key values with the standard string characters of the types; the types are one or more;

fusing each standard character string character, specifically: fusing the standard character string symbols of the same type, and setting a key value for the standard character string symbols of the same type; and reserving data with higher key value in the same type of standard string characters, eliminating data with lower key value, and setting the higher key value for the reserved data.

Further, the main processing module is configured to process the fused data, specifically, obtain and decode the fused standard string identifier, process the data content represented by the decoded standard string identifier, and send a result obtained by the processing and the corresponding standard string identifier to the auxiliary processing module.

Further, the decoding the standard string indicator specifically includes obtaining a key value and a character numerical value of the standard string indicator, and obtaining output content according to a reverse-deduction relationship.

Further, the auxiliary processing module stores the processed result and the corresponding standard character string, extracts the standard character string and the key words in the standard character string, labels the key words, labels the processed result as a preset processing flow of the standard character string, and stores the result;

and when extracting the standard character string symbols and the key words in the standard character string symbols and labeling the key words, the key words comprise newly added key words.

Further, the auxiliary processing module is used for outputting a preset processing flow.

A method of data mining, the method comprising the steps of:

a1, acquiring a binary string character, extracting key words in the binary string character, and marking the key words as definition problems;

a2, establishing data relation with a big data platform, searching by taking a defined problem as a keyword, and marking a search result as a database;

a3, when the data in different databases are the same, marking the definition problems corresponding to the databases as related group problems;

a4, performing data mining on the association group questions.

Further, when 35% of data in two different databases are identical, the data is considered to be identical.

Compared with the prior art, the invention has the beneficial effects that:

(1) the data to be processed is obtained through the data preprocessing module, the data to be processed is converted into a standard string character through the standard conversion module, the standard string character is compared with a preset character string stored in the module to obtain a contrast value, the standard string character is compared with a preset character string stored in the module to obtain the contrast value, the standard string character is compared with the preset character string stored in the module to obtain the contrast value, the preset flow with high matching degree is intelligently identified to process the data during preprocessing, the flow of complex problem processing is reduced, meanwhile, when an end word appears in the binary string character, the follow-up binary string character of the end word is matched with the preset character string again, the successfully matched preset character string is marked as a second tail character string, the operation is repeated, and the successfully matched preset character string is marked as a third tail character string, The fourth tail character string … … and the Nth tail character string are matched and finished until no key word in the follow-up binary string character of the finishing word is the same as the preset character string, so that the problems to be processed can be separated and spliced, the combined problems can be screened, and the preprocessing is more intelligent;

(2) the auxiliary processing module is used for storing the processed result and the corresponding standard character string, extracting the standard character string and key words in the standard character string, labeling the key words, labeling the processed result as a preset processing flow of the standard character string, and storing the preset processing flow so that the non-preset processing flow is recorded by the processing flow, therefore, the intelligent learning system has certain intelligent learning capacity, and further, complete intellectualization is gradually realized in continuous accumulation;

(3) obtaining a binary string character, extracting key words in the binary string character, and marking the key words as definition problems; establishing data connection with a big data platform, searching by taking a defined problem as a keyword, and marking a search result as a database; when the same data occurs in different databases, marking the definition problems corresponding to the databases as related group problems; the association group questions are subjected to data mining, so that association can be realized in the data processing process, the time required by data mining is reduced, and meanwhile, the keywords obtained by data mining are all from a data processing system, so that the time for standardizing the keywords is greatly reduced, and the mining time is shortened;

(4) meanwhile, the data analysis and mining method determines the range of the database to be administered, the organization form of the data and the like until the database is converted into a real database through the abstract organization of various data by big data retrieval, analyzes and establishes database entities and the relationship among the entities by establishing a plurality of databases and mapping the databases and key words, then collects, arranges and cleans the data of different data sources through data integration, loads and stores the data after conversion, is convenient for people to explore analysis results and does not need to establish the database again when in use, and saves mining time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic block diagram of the present invention;

FIG. 2 is a block diagram of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Thus, the detailed description of the embodiments of the present invention provided in the following drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention.

As shown in fig. 1, a data processing system includes a data preprocessing module, a distribution module, a fusion module, an auxiliary processing module, and a main processing module;

the data preprocessing module comprises data preprocessing and data matching processing; the distribution module is used for distributing, packaging and dispatching the data; the fusion module is used for carrying out fusion processing on the data; the main processing module is used for processing the fused data; the auxiliary processing module is used for outputting a preset processing flow.

In an embodiment of the present invention, the modules all use a processor as a carrier, where the processor is an integrated circuit chip and has signal processing capability. In implementation, the steps of data processing and data mining may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods and steps of the invention in embodiments of the invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method according to the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art.

Specifically, when the present invention is implemented, the data preprocessing module includes data preprocessing and data matching processing, specifically, acquiring data to be processed, and converting the data to be processed into a standard string character through the standard conversion module; comparing the standard character string with a preset character string stored in the module, and obtaining a comparison value; when the contrast value is greater than or equal to 95%, acquiring a preset processing flow corresponding to the preset character string, and sending the preset processing flow to the auxiliary processing module for processing;

the auxiliary processing module stores the processed result and the corresponding standard character string, extracts the standard character string and key words in the standard character string, labels the key words, labels the processed result as a preset processing flow of the standard character string, and stores the result;

specifically, the standard character string symbols and key words in the standard character string symbols are extracted, and when the key words are labeled, the key words comprise newly-added key words;

it should be noted that the selection of the keywords needs to meet the following requirements, and the data is analyzed to find out vocabularies and important words of the central content and the subject concept; referring to the corresponding standard vocabulary in the vocabulary table in the relevant standard, selecting the standard vocabulary as an important word as much as possible; removing words with indefinite meaning, no special meaning and no retrieval value, and deleting synonyms and synonyms; articles, pronouns, prepositions, conjunctions, interjections, and certain verbs (associated verbs, emotional verbs), etc. are not used; theories, reports, experiments, learning, methods, problems, countermeasures, approaches, characteristics, purposes, concepts, development and the like do not have specific conceptual words, evaluation words and unknown common words are not used; the mathematical formula and the chemical formula can be selected as important words; professional codes, names of persons and places of materials, equipment and methods can be used as keywords; when the contrast value is less than 95%, the standard character string is sent to the main processing module for processing;

the data to be processed is converted into the standard string characters through the standard conversion module, namely, the data to be processed is obtained and identified, when the data can be converted into text information, the data is converted into text messages, characters in the text are sequentially converted into binary string characters, and the binary string characters are the standard string characters;

when the method is implemented specifically, the steps of sequentially converting the characters into the binary string characters are specifically that digits of the converted characters in a modern Chinese common character table are obtained, and the digits are converted into the binary system, for example, the binary digit number of the Chinese character 'one' is 0001;

The distribution module is used for comparing keywords of the standard string symbols with the contrast value of less than 95%, distributing according to the compared keywords, specifically, extracting the keywords in the standard string symbols, matching the keywords with the keywords stored in the standard string symbols, and distributing the standard string symbols with the same matching result to corresponding data baskets to complete distribution.

Furthermore, a fusion module is connected between the data basket and the main processing module; the fusion module performs data fusion on the standard character string symbols in the data basket, specifically acquires the standard character string symbols in the data basket, extracts key words in the standard character string symbols, and performs splicing fusion on different standard character string symbols according to the key words, specifically acquires the key words in the standard character string symbols, matches the standard character string symbols with the same key words, and performs data fusion on the successfully matched standard character string symbols;

furthermore, key values in different standard character strings are obtained, so as to obtain the types and key values of the standard character strings; fusing each standard character string character based on the types and key values of different standard character string characters;

the method comprises the following steps of obtaining key values in different standard string identifiers, specifically: classifying the standard string symbols, and associating the key values with the standard string symbols with the same type; the types are one or more;

in the practice of the present invention, the same type means that the subject matter itself is consistent in nature and character. Such as paying money all together, or delivering the same kind. The quality is the same, and the quality, specification and grade of the index objects are not different, such as Tianjin rice grade I. Debt categories differ in quality and in principle do not allow for cancellation. If the types and qualities of the paid debts are different, the debts have different economic purposes and are likely to be lost by the cancellation, and the debts have different economic values and are difficult to be fairly cancelled.

Fusing each standard character string character, specifically: fusing the standard character string symbols of the same type, and setting a key value for the standard character string symbols of the same type; and reserving data with higher key value in the same type of standard string characters, eliminating data with lower key value, and setting higher key value for the reserved data.

Further, the main processing module is configured to process the fused data, specifically, obtain the fused standard string identifier, decode the standard string identifier, process the data content represented by the decoded standard string identifier, and send a result obtained by the processing and the corresponding standard string identifier to the auxiliary processing module.

In the specific implementation of the invention, the key value and the character numerical value of the standard character string are obtained, and the output content is obtained according to the reverse-deducing relationship;

when the present invention is implemented, the data connection between the modules may include a wired communication component or a wireless communication component; the wired communication component can be a transmission line and a USB interface; the wireless communication component may include a Bluetooth module, a wifi module, a 3G/4G/5G module, etc.

As shown in fig. 2, the present invention further relates to a data mining method, specifically, acquiring a binary string character, extracting key words in the binary string character, and marking the key words as definition problems; establishing data connection with a big data platform, searching by taking a defined problem as a keyword, and marking a search result as a database; when the same data occurs in different databases, marking the definition problems corresponding to the databases as related group problems; and performing data mining on the associated group questions.

When 35% of data in two different databases are the same, the situation that the data are the same is determined;

the method includes a Memory, i.e., a machine-readable storage medium, for storing one or more computer instructions, which are executed by a processor to implement the steps of the vehicle condition detecting method, and the Memory may include a high-speed Random Access Memory (RAM) and a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network and the like can be used. The bus may be an ISA bus, a PCI bus, an EISA bus, or the like, and may be divided into an address bus, a data bus, a control bus, or the like.

The above formulas are all calculated by taking the numerical value of the dimension, the formula is a formula which obtains the latest real situation by acquiring a large amount of data and performing software simulation, and the preset parameters in the formula are set by the technical personnel in the field according to the actual situation.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and there may be other divisions when the actual implementation is performed; the modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the method of the embodiment.

It will also be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above examples are only intended to illustrate the technical process of the present invention and not to limit the same, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical process of the present invention without departing from the spirit and scope of the technical process of the present invention.

Claims

1. A data processing system is characterized by comprising a data preprocessing module, a distribution module, a fusion module, an auxiliary processing module and a main processing module;

2. The data processing system of claim 1, wherein the assignment module is configured to perform keyword comparison on the standard string tokens with the comparison value less than 95%, and to perform assignment according to the comparison keyword, specifically:

3. The data processing system of claim 2, wherein a fusion module is further connected between the data basket and the main processing module;

4. The data processing system of claim 3, wherein the matching of the standard string tokens with the same keyword and the data fusion of the successfully matched standard string tokens are to obtain key values of different standard string tokens, thereby obtaining the type of the standard string token and the key value thereof; fusing each standard character string character based on the types and key values of different standard character string characters;

5. The data processing system according to claim 4, wherein the main processing module is configured to process the fused data, specifically, obtain and decode the fused standard string identifier, process the data content represented by the decoded standard string identifier, and send the processed result and the corresponding standard string identifier to the auxiliary processing module.

6. The data processing system of claim 5, wherein the decoding of the standard string indicator is further configured to obtain a key value and a character value of the standard string indicator, and obtain the output content according to a reverse-derivation relationship.

7. The data processing system of claim 6, wherein the auxiliary processing module stores the processed result and the corresponding standard string symbol, extracts the standard string symbol and the key words in the standard string symbol, labels the key words, and labels the processed result as a preset processing flow of the standard string symbol and stores the result;

8. The data processing system and the data mining method as claimed in claim 7, wherein the auxiliary processing module is configured to output a predetermined processing flow.

9. A data mining method, characterized in that the data mining method comprises the steps of:

a4, performing data mining on the association group questions.

10. A method according to claim 9, wherein 35% of the data in two different databases that are identical is considered to be identical.