CN117971819B

CN117971819B - Management method and system for automatically collecting stream data

Info

Publication number: CN117971819B
Application number: CN202410371733.XA
Authority: CN
Inventors: 武春庆
Original assignee: Nanjing Jinding Jiaqi Information Technology Co ltd
Current assignee: Nanjing Jinding Jiaqi Information Technology Co ltd
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-05-31
Anticipated expiration: 2044-03-29
Also published as: CN117971819A

Abstract

A method of managing automatically aggregated pipeline data, comprising: obtaining a plurality of parts of stream data; identifying the type name and the type information of each piece of stream data, and generating an associated mapping value based on the type information; judging whether the associated stream data sets are in the cache based on the mapping value of the stream data, if not, storing the stream data into the cache; if yes, based on the stream data and the associated stream data set, the accuracy of identification is judged, and the identification is comprehensively stored in the association database. The invention adopts a new method for automatically collecting the flow data, which is different from the traditional method that only adopts the regular expression system to identify the flow data or adopts natural language processing to identify the flow data. The method has extremely high recognition accuracy.

Description

Management method and system for automatically collecting stream data

Technical Field

The invention belongs to the technical field of data analysis, and particularly relates to a management method and system for automatically collecting flow data.

Background

In modern economic activities, streaming data, such as banking records, electronic payment information, and communication tickets, plays a vital role. These data reflect the financial status, consumption habits and communication patterns of individuals and businesses. However, since such information typically appears in unstructured text form, the integration, analysis, and management of data becomes extremely complex and time consuming. In the current context, businesses and individuals need to collect and integrate pipeline data from a variety of sources to support critical business activities such as decision making, financial management, and customer service.

Currently, techniques for automatically aggregating pipelined data rely mainly on methods such as regular expressions, natural Language Processing (NLP), data mining, and machine learning. By implementing advanced entity recognition and pattern matching algorithms, the system is able to recognize key information in text, such as date, amount, and participants, etc. In addition, text classification and emotion analysis techniques are used to further analyze the data, thereby providing insight to the user. The integration of these techniques creates a system that automatically processes and aggregates the pipeline data, providing a clearer, ordered financial picture for the user.

Despite significant advances, current automated aggregate pipeline techniques still suffer from a number of shortcomings. First, regular expressions of the system remain a challenge for the processing of complex and non-standardized text. Second, natural language processing is not deep enough for implicit meaning in text and understanding of contextual information, which limits the accuracy of information extraction and the range of application of the system.

Disclosure of Invention

In order to solve the defects existing in the prior art, the invention aims to solve the defects, and further provides a management method and a system for automatically collecting flow data.

The invention adopts the following technical scheme.

The invention discloses a management method for automatically collecting flow data, which comprises the following steps of 1 to 3;

step 1, obtaining a plurality of parts of stream data;

Step 2, identifying the type name and the type information of each piece of stream data, and generating an associated mapping value based on the type information;

Step 3, based on the mapping value of the stream data, judging whether the associated stream data sets are all in the cache, if not, storing the stream data in the cache; if yes, based on the stream data and the associated stream data set, the accuracy of identification is judged, and the identification is comprehensively stored in the association database.

Further, the method further comprises preprocessing the multiple parts of the running water data before the multiple parts of the running water data are acquired, and comprises the following steps: data cleaning and normalization.

Further, based on the type information, the generation of the associated mapping value is specifically: and generating an associated mapping value based on the transaction counterpart information or the transaction platform information in the type information.

Furthermore, in the step 2, recognition segmentation of the running water data is realized in a regular expression mode.

Further, based on the type information, the generation of the associated mapping value is specifically: generating a first mapping value based on the transaction time in the type information, and generating a plurality of second mapping values based on a plurality of scarcity keywords in transaction counterpart information or transaction platform information in the type information; wherein, the scarce keyword refers to a Chinese character or a word.

Further, in step 3, based on the mapping value of the stream data, judging whether the associated stream data sets are all cached or not, wherein the steps comprise steps 3.1-3.3;

step 3.1, acquiring a first subsequent mapping value based on the first mapping value;

step 3.2, based on the first mapping value of the streaming data, acquiring all the streaming data in the first mapping value and the first subsequent mapping value as a streaming data set to be compared;

And 3.3, comparing the second mapping value of the stream data with the second mapping value of each stream data in the stream data set to be compared in sequence, and judging that each stream data is associated stream data if the similarity of the second mapping value and the second mapping value is greater than or equal to a preset similarity threshold value.

Further, the process of selecting the scarcity keyword may specifically include step S101 to step S103;

Step S101, each keyword in transaction counterpart information or transaction platform information in the type information in the stream data is acquired;

Step S102, calculating a mapping value corresponding to each keyword according to Huffman coding;

Step S103, sorting the mapping values from high to low, and selecting n mapping values with the largest mapping values as a plurality of second mapping values, wherein n is the number of the second mapping values.

The second aspect of the present invention discloses a management system for automatically aggregating pipeline data, which is applied to the method of the first aspect, and comprises: the device comprises a data acquisition module, a logic judgment module, a first data storage module and a second data storage module;

the data acquisition module is used for acquiring a plurality of parts of stream data;

The logic judging module is used for identifying the type name and the type information of each piece of stream data and generating an associated mapping value based on the type information; judging whether the associated stream data sets are all in the cache or not based on the mapping value of the stream data; and based on the stream data and its associated stream data set, to determine the accuracy of the identification;

the first data storage module is used for storing the running water data into the cache;

The second data storage module is used for storing the combination into the relevance database.

The third aspect of the invention discloses a terminal, which comprises a processor and a storage medium; the method is characterized in that:

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of the first aspect.

A fourth aspect of the invention discloses a computer-readable storage medium on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to the first aspect.

Compared with the prior art, the invention has the following advantages:

The invention adopts a new method for automatically collecting the flow data, which is different from the traditional method that only adopts the regular expression system to identify the flow data or adopts natural language processing to identify the flow data. The method has extremely high recognition accuracy.

Drawings

FIG. 1 is a flow chart of a method for automatically aggregating pipelined data according to an embodiment of the present invention.

Detailed Description

The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.

The invention discloses a management method for automatically collecting flow data, which can comprise the steps 1-3 as shown in figure 1.

And step 1, obtaining a plurality of parts of flow data.

The flow data may be a bank account number, an electronic payment platform (such as a payment treasury, a WeChat, etc.), or transaction data of other communication service providers. Where the streaming data is text data comprising a variety of formats and structures, typically,

In some embodiments, the method further comprises preprocessing the plurality of pieces of pipeline data before acquiring the plurality of pieces of pipeline data, comprising: data cleaning and normalization. The data cleaning is used for removing irrelevant information, and the number of the stream data is reduced. Normalization is used to format all of the pipeline data in unison, while identifying key information therein, such as: transaction date, amount, transaction party, transaction description, etc., to facilitate later storage.

And 2, identifying the type name and the type information of each piece of stream data, and generating an associated mapping value based on the type information.

The entity information may include transaction time, transaction amount, transaction type, transaction party information, transaction counterpart information, transaction platform information, order number, and the like.

In the first embodiment of the present invention, the associated map value may be generated based on the transaction counterpart information or the transaction platform information in the type information. It is assumed that a piece of pipeline data X may contain the following information: transaction recipe information: "x0", transaction partner information: "x1", the associated mapping value is only the mapping value corresponding to "x 1"; it is assumed that a piece of pipeline data Y may contain the following information: transaction recipe information: "y0", transaction partner information: "y1", transaction platform information: "y2", the associated mapping value is 2 mapping values corresponding to "y1" and "y2", respectively.

In the first embodiment, the generation manner of the mapping value may be generated by a hash function, and a specific formula thereof is not described in detail.

In some embodiments, the cache may select a redis database.

It will be appreciated that assuming that the pipeline data is pipeline data X, the number of pipeline data in its associated pipeline data set is at least 1; assuming that the pipeline data is pipeline data Y, the number of pipeline data in the associated pipeline data set is at least 2. In step 3, the integrated storing into the relevance database means that not only the stream data is stored into the relevance database, but also the associated stream data set is transferred from the cache into the relevance database.

In step 3, taking the pipeline data Y as an example, at least 2 pipeline data S, T and pipeline data Y associated with the corresponding 2 mapping values should be the same pipeline data, and only the information of both parties of the transaction and the information of the transaction platform are changed.

In modern data processing practice, automatically identifying and assembling pipeline data is an important and challenging task. The standard mechanical recognition of the entity information of each piece of running water data, for example, the rapid recognition of the entity information in the running water data through a preset regular expression, is a common method, but the method has inherent limitations. In particular, the order and types of entity types contained in the pipeline data may not be the same, affecting the consistency and accuracy of the data processing. In addition, problems in the data acquisition process, such as page number conversion errors, may cause the running water records to be combined or split erroneously, or mixed with irrelevant information, further reducing the accuracy of the automated process. Even if the success rate of automatically collecting the flow data is up to 99%, the subsequent missing and leak repairing work can still cause high cost, thereby counteracting the efficiency advantage brought by automation.

In other embodiments, natural Language Processing (NLP) based techniques are introduced to increase the recognition rate of entities in text, however, the recognition accuracy is not even as high as regular expressions, and the data definition problem cannot be solved by natural language processing. That is, it is difficult in the prior art to precisely define which content belongs exactly to a particular pipeline record, or not to any pipeline data at all.

In the embodiment of the invention, the step 3 judges the accuracy of identification by means of cross-validation (i.e. based on the stream data and the associated stream data set), and the cross-validation can easily solve the problems of identification accuracy and data definition. Therefore, in the step 2, rapid identification segmentation of the running water data can be realized completely through a regular expression mode. However, the above steps still have defects, and above all, the purpose of the mapping value of the present invention is to accurately classify and identify the flow data, and the mapping value is determined by the type information (such as transaction counterpart information or transaction platform information) in the flow data. Thus, once a mistake is identified by regular expression classification, the mapping values are wrong, which essentially creates a problem of whether a chicken or an egg is in advance. Second, since the pipeline data is typically processed in batches (and the transaction information is typically the same), it is easy to understand that the pipeline data of the first batch must be stored in the cache entirely, and the cache is typically memory, which may not be spatially compatible with huge amounts of data.

Based on this, in the second embodiment of the present invention, based on the type information, the generation of the associated mapping value is specifically: generating a first mapping value based on the transaction time in the type information, and generating a plurality of second mapping values based on a plurality of scarcity keywords in transaction counterpart information or transaction platform information in the type information; wherein the scarce keyword (and keywords hereinafter) refers to one Chinese character or one word.

The advantage of using transaction time to generate the associated map value is that the format of the transaction time itself is deterministic, and the order of the stream data is typically ordered according to the transaction time, and the two stream data can be cross-validated to obtain an accurate transaction time.

More specifically, it is contemplated that the running water data may have a time difference with its associated running water dataTherefore, the first mapping value should cover the entire mapping interval (i.e. [/>, hereinafter] Or a combination of the above-mentioned data. Thus, in a second embodiment, the first mapping value/>May be a timestamp associated with the transaction time as shown in the following equation:

wherein, Is the remainder operation symbol,/>Is the timestamp of the transaction time,/>Is greater than/>Is an integer of (a).

It can be appreciated that the second embodiment essentially exploits the idea of bucket ordering, namely: generating a first mapping value (essentially a mapping interval, each mapping interval being considered as a barrel) based on transaction time, and determining approximately in which barrel the pipeline data is; the scarce key is then used to determine the specific location of the pipeline data, for example by means of a hash table.

Therefore, in step 3, based on the mapping value of the pipeline data, it is determined whether the associated pipeline data sets are all in the cache, specifically including steps 3.1 to 3.3.

And 3.1, acquiring a first subsequent mapping value based on the first mapping value.

The first subsequent mapped value refers to the next mapped value of the first mapped value. Understandably, the first subsequent mapping value。

And 3.2, based on the first mapping value of the streaming data, acquiring all the streaming data in the first mapping value and the first subsequent mapping value as a streaming data set to be compared.

It will be appreciated that in the cache, the second plurality of mapped values of the associated pipeline data should also be ordered from high to low, forming a vector. Typically, the total number n of second mapping values should be at least equal to or greater than 5, and the preset similarity threshold may be set to 80%. The two similarities refer to the ratio of the number of repeated data in the second mapping value of the two pieces of stream data to the total number.

It should be noted that, in step 3.3, if the similarity between the two pieces of pipeline data is greater than or equal to the preset similarity threshold, it is generally further necessary to further compare the two pieces of pipeline data information, so as to further determine whether each piece of pipeline data is associated pipeline data, however, this determination method is not intended to continue to expand the length of the number of second mapping values to compare, so as to prevent that the previous n is too small, resulting in inaccurate determination, and accordingly, when the number of second mapping values is expanded, the similarity threshold should also be increased. The specific process is not described in detail.

In order to prevent the rare keywords from selecting common keywords in the pipeline data, in the embodiment of the present invention, the process of selecting the rare keywords may specifically include steps S101 to S103.

Step S101, each keyword in the transaction counterpart information or the transaction platform information in the type information in the stream data is acquired.

Step S102, according to Huffman coding, a mapping value corresponding to each keyword is calculated.

It can be understood that the keywords corresponding to the largest n second mapping values are rare keywords.

It should be noted that Huffman Coding (Huffman Coding) is used in step S102, and is not to compress data, but rather, because the core idea of Huffman Coding is to allocate bit sequences of different lengths, i.e., codes, according to the frequency or probability of occurrence of each key. In the huffman coding process, the keyword with the highest occurrence frequency is assigned the shortest mapping code, and the keyword with the low occurrence frequency is assigned the longer mapping code, so that the mapping value corresponding to each keyword in step S102 is the mapping code in the huffman code corresponding to the keyword.

Correspondingly, the invention also discloses a management system for automatically collecting the flow data, which comprises the following steps: the device comprises a data acquisition module, a logic judgment module, a first data storage module and a second data storage module;

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory.

By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A management method for automatically collecting flow data is characterized by comprising the following steps of 1-3;

step 1, obtaining a plurality of parts of stream data;

Step 2, identifying the type name and the type information of each piece of stream data, and generating an associated mapping value based on the type information; based on the type information, the generation of the associated mapping value is specifically: generating a first mapping value based on the transaction time in the type information, and generating a plurality of second mapping values based on a plurality of scarcity keywords in transaction counterpart information or transaction platform information in the type information; wherein, the scarce keyword refers to a Chinese character or a word;

The process of selecting the scarcity keywords specifically comprises the steps S101-S103;

step S103, sorting the mapping values from high to low, and selecting n mapping values with the largest mapping values as a plurality of second mapping values, wherein n is the number of the second mapping values;

Step 3, based on the mapping value of the stream data, judging whether the associated stream data sets are all in the cache, if not, storing the stream data in the cache; if yes, judging the accuracy of identification based on the stream data and the associated stream data set thereof, and comprehensively storing the accuracy in a relevance database;

in step 3, based on the mapping value of the stream data, judging whether the associated stream data sets are all in the cache or not, wherein the steps comprise step 3.1-step 3.3;

2. The method for automatically aggregating pipelined data according to claim 1, further comprising preprocessing the multiple pieces of pipelined data before obtaining the multiple pieces of pipelined data, comprising: data cleaning and normalization.

3. The method for automatically collecting and managing the flow data according to claim 1, wherein in the step 2, the identification and segmentation of the flow data are realized by means of regular expressions.

4. A management system for automatically aggregating pipeline data, applied to the method of any one of claims 1 to 3, characterized in that the system comprises: the device comprises a data acquisition module, a logic judgment module, a first data storage module and a second data storage module;

5. A terminal comprising a processor and a storage medium; the method is characterized in that:

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1-3.

6. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-3.