CN113435701B - Method and device for processing consumption quality information - Google Patents

Method and device for processing consumption quality information Download PDF

Info

Publication number
CN113435701B
CN113435701B CN202110598175.7A CN202110598175A CN113435701B CN 113435701 B CN113435701 B CN 113435701B CN 202110598175 A CN202110598175 A CN 202110598175A CN 113435701 B CN113435701 B CN 113435701B
Authority
CN
China
Prior art keywords
data
quality information
data item
consumption quality
consumption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110598175.7A
Other languages
Chinese (zh)
Other versions
CN113435701A (en
Inventor
刘俊彦
王柏林
倪坪雄
魏伟力
刘甦儿
陈锦笑
钟丽红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Consumer Report Magazine Co ltd
Original Assignee
Consumer Report Magazine Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Consumer Report Magazine Co ltd filed Critical Consumer Report Magazine Co ltd
Priority to CN202110598175.7A priority Critical patent/CN113435701B/en
Publication of CN113435701A publication Critical patent/CN113435701A/en
Application granted granted Critical
Publication of CN113435701B publication Critical patent/CN113435701B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Game Theory and Decision Science (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of computers, and discloses a method and a device for processing consumption quality information, wherein the method comprises the following steps: acquiring a plurality of first table files of consumption quality information, mapping each data item name in all consumption quality information items in each first table file into a preset second table file, and performing data cleaning to obtain a normalized data item field; and matching the normalized data item field with the third table file, and replacing the normalized data item field with a standardized data item field to obtain the structured consumption quality information. The invention has the beneficial effects that: the acquired large number of first form files are subjected to batch processing, data contents in the first form files are arranged into standardized and standardized data item fields, extra labor cost is not required in the process, the data processing speed and efficiency are improved, and the requirement for statistical analysis of consumer product quality information can be met.

Description

Method and device for processing consumption quality information
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for processing consumption quality information.
Background
With the rapid development of social economy, the appearance of abundant and various consumer goods brings great physical and mental enjoyment to the life of people, but meanwhile, part of quality problems objectively existing in the consumer goods can also bring potential quality and safety risks to consumers. How to effectively supervise the quality situation of the consumer goods becomes a challenge of an era. By carrying out statistical analysis on the consumer product quality information data, the method is beneficial to controlling the overall quality condition of the consumer products in various fields of various regions, so that an effective supervision strategy is formulated, and the method is an important means for quality supervision.
If the consumption quality information needs to be acquired firstly for statistical analysis on the consumption quality information, the data of the current consumption quality information is huge in quantity, scattered in distribution, different in storage mode and irregular in certain degree. How to efficiently manage and utilize such massive amounts of data with such data is a challenge.
The consumption quality information includes a large amount of unstructured text information, and there is no good method to process it at present. Data cleaning and preprocessing are usually performed on existing unstructured text information in a mode of using a large amount of human resources, so that key information is extracted, and the information is converted into structured database records and then is subjected to subsequent operation. However, this method is not only very inefficient, but also wastes a lot of human resources and costs, increasing the overhead. This inefficient approach does not meet the needs for consumer quality information statistics and analysis, and further improvements are needed.
Disclosure of Invention
The purpose of the invention is: the method and the device for processing the quality information of the consumer goods efficiently are provided, the data processing speed and efficiency are improved, the investment of labor cost is reduced, and the requirement for statistical analysis of the quality information of the consumer goods is met.
In order to achieve the above object, the present invention provides a method for processing consumption quality information, comprising:
acquiring a plurality of first table files of consumption quality information, wherein one piece of consumption product quality information corresponds to one or more first table files; each first table file comprises a plurality of consumption quality quantity information entries, and each consumption quality quantity information entry comprises a plurality of data item names.
According to a preset first mapping rule, sequentially mapping each data item name in all consumption quality information items in each first table file to a preset second table file, wherein the second table file comprises a plurality of data item fields corresponding to the data item names; the first mapping rule includes a mapping relationship between a data item name and a data item field.
Cleaning the data item fields mapped into the second form file according to a preset data cleaning rule to obtain normalized data item fields;
and replacing the normalized data item fields in the second table file with standardized data item fields according to a preset third table file to obtain the structured consumption quality information, wherein the third table file comprises the standardized data item fields corresponding to the normalized data item fields.
Further, the mapping, according to a preset first mapping rule, each data item name in all the consumption quality information items in each first table file to a preset second table file in turn specifically includes:
if the data item name and the data item field can be mapped, recording the data item name into the corresponding data item field of the second table file;
and if the data item name and the data item field cannot be mapped, generating a mapping relation between the data item name and the data table field according to a text similarity algorithm and a text classification technology, and updating the generated mapping relation into a first mapping rule.
Further, after obtaining the structured consumption quality information, the processing method further includes:
dividing the consumption quality information items with the same core data item field into the same data group to obtain a plurality of data groups with different core data item fields; traversing all the consumption quality information entries in each data group, judging whether the secondary core data entry fields of each consumption quality information entry in the same data group are the same, if the secondary core data entry fields of a plurality of consumption quality information entries are the same, reserving one of the consumption quality information entries, marking the rest consumption quality information entries in a waste state, and if the secondary core data entry fields are different, reserving all the consumption quality information entries to obtain a fourth table file.
Further, after obtaining the fourth table file, the processing method further includes:
and performing data mining on the fourth table file, specifically:
the standardized data entry field in the structured consumption quality information comprises: business name and product name.
Digging out the registration address of the enterprise and the standard administrative division of the enterprise according to the name of the enterprise, and recording the excavated registration address of the enterprise and the standard administrative division of the enterprise into a fourth table file; and classifying the product according to the product name to enable the product name to correspond to the standardized consumer goods classification, and recording the corresponding standardized consumer goods classification name into a fourth table file.
Further, the method further comprises: and splitting the fourth table file subjected to data mining to obtain a plurality of related sub-table files.
Further, the first table file for acquiring the consumption quality information specifically includes:
crawling a plurality of consumption quality quantity spot check notices issued by supervision departments at all levels, removing the weight of the crawled consumption quality quantity spot check report, respectively extracting the text of each consumption quality spot check notice and table data in the accessories after the weight is removed, and respectively storing the extracted table data into a plurality of first table files by taking the consumption quality quantity spot check notices as units; a plurality of first table files storing consumption quality amount information is obtained.
The invention also discloses a device for processing the consumption quality information, which comprises: the device comprises a data acquisition module, a first mapping module, a data cleaning module and a structuring module.
The data acquisition module is used for acquiring a plurality of first table files of consumption quality information, wherein one piece of consumption product quality information corresponds to one or more first table files; each first table file comprises a plurality of consumption quality quantity information entries, and each consumption quality quantity information entry comprises a plurality of data item names.
The first mapping module is used for sequentially mapping each data item name in all the consumption quality information items of each first table file into a preset second table file according to a preset first mapping rule, wherein the second table file has a plurality of data item fields corresponding to the data item names; the first mapping rule includes a mapping relationship between a data item name and a data item field.
And the data cleaning module is used for cleaning the data item fields mapped into the second form file according to a preset data cleaning rule to obtain the normalized data item fields.
And the structuring module replaces the normalized data item fields in the second form file with standardized data item fields according to a preset third form file to obtain the structured consumption quality information, wherein the third form file comprises the standardized data item fields corresponding to the normalized data item fields.
Further, the processing device further comprises: a data deduplication module;
the data deduplication module is used for dividing consumption quality information items with the same core data item field into the same data group after obtaining the structured consumption product quality information to obtain a plurality of data groups with different core data item fields; traversing all the consumption quality information entries in each data group, judging whether the secondary core data entry fields of each consumption quality information entry in the same data group are the same, if the secondary core data entry fields of a plurality of consumption quality information entries are the same, reserving one of the consumption quality information entries, marking the rest consumption quality information entries in a waste state, and if the secondary core data entry fields are different, reserving all the consumption quality information entries to obtain a fourth table file.
Further, the processing device further comprises: a data mining module;
the data mining module is configured to perform data mining on the fourth table file, and specifically includes: the standardized data entry field in the structured consumption quality information comprises: enterprise name and product name; digging out the registration address of the enterprise and the standard administrative division to which the registration address belongs according to the name of the enterprise, and recording the excavated registration address of the enterprise and the standard administrative division to which the registration address belongs to a fourth table file; and classifying the product according to the product name to enable the product name to correspond to the standardized consumer goods classification, and recording the corresponding standardized consumer goods classification name into a fourth table file.
Further, the processing device further comprises: a data reconstruction module;
and the data reconstruction module is used for splitting the fourth table file subjected to data mining to obtain a plurality of sub-table files which are related to each other.
Compared with the prior art, the method and the device for processing the consumption quality information have the advantages that: the acquired large number of first form files are subjected to batch processing, data contents in the first form files are arranged into standardized and standardized data item fields, extra labor cost is not required in the process, the data processing speed and efficiency are improved, and the requirement for statistical analysis of consumer product quality information can be met.
Drawings
Fig. 1 is a flowchart illustrating a method for processing consumption quality information according to embodiment 1 of the present invention;
fig. 2 is a schematic structural diagram of a processing device for consuming quality information according to embodiment 2 of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.
Example 1:
the consumption quality information comprises a large amount of unstructured text information, a perfect data structured processing method flow does not exist in the field of consumer product quality, existing data extraction only aims at structured data, a large amount of manpower is required to be invested in the unstructured data for processing, and the accuracy and the extraction efficiency of the data are generally not high.
The invention provides a set of perfect data structuring method aiming at consumption quality information, structured and unstructured data can be efficiently processed by formulating rules, data processing of various storage modes is supported, self-perfection of a system is carried out on data which cannot be processed by the existing rules through machine learning, manual intervention is not needed, cost is reduced, and efficiency and data accuracy are improved.
As shown in fig. 1, a method for processing consumption quality information according to a preferred embodiment of the present invention includes:
step S1, obtaining a plurality of first table files of consumption quality information, wherein one piece of consumption product quality information corresponds to one or more first table files; each first table file comprises a plurality of consumption quality quantity information entries, and each consumption quality quantity information entry comprises a plurality of data item names.
Step S2, according to a preset first mapping rule, sequentially mapping each data item name in all the consumption quality information items in each first table file to a preset second table file, wherein the second table file has a plurality of data item fields corresponding to the data item names; the first mapping rule includes a mapping relationship between a data item name and a data item field.
And step S3, cleaning the data item fields mapped into the second form file according to a preset data cleaning rule to obtain normalized data item fields.
And step S4, replacing the normalized data item fields in the second form file with standardized data item fields according to a preset third form file to obtain the structured consumption quality information, wherein the third form file comprises the standardized data item fields corresponding to the normalized data item fields.
In step S1, the first table file for acquiring the consumption quality information includes: crawling a plurality of consumption quality quantity spot check notices issued by supervision departments at all levels, removing the weight of the crawled consumption quality quantity spot check report, respectively extracting the text of each consumption quality spot check notice and table data in the accessories after the weight is removed, and respectively storing the extracted table data into a plurality of first table files by taking the consumption quality quantity spot check notices as units; a plurality of first table files storing consumption quality amount information is obtained.
In an embodiment of the invention, the consumption quality information entry comprises: a consumer goods spot check information item and a consumer goods recall information item. In the embodiments, the sampling inspection information items of the consumer goods are used for illustration, and a person skilled in the art can select corresponding quality information of the consumer goods and information items corresponding to the consumption quality information according to needs.
In the embodiment of the invention, the first form file refers to a standardized excel form generated after extracting the spot check data tables in the text and the attachment from the consumer goods spot check bulletin; the second table file, the third table file and the fourth table file refer to database corresponding tables.
Each level of supervision departments can issue consumption quality quantity spot check announcements, but table files in the consumption quality quantity spot check announcements issued by each level of supervision departments do not have the same standard, data item names of different supervision departments may have different names, the number of the data item names may also have different names, and the standards of contents recorded in the data item names are also different. Form files in the consumption quality spot check bulletin issued by a plurality of supervision departments cannot be processed together, so that the consumption quality spot check bulletin can only evaluate product information within a limited range, and the consumption quality information is not utilized.
In an embodiment of the present invention, the processing method further includes extracting management indexes such as a sampling unit, a sampling batch, a sampling class, a qualification rate, and the like in the bulletin through dependency syntax analysis.
In an embodiment of the present invention, the data item names in the consumable spot check data information comprise one or more of the following data items:
serial number, sampling inspection classification, product name, specification model number, production enterprise name, production enterprise address, unified social credit code or organization code, production enterprise contact, production lot/date, sampling inspection result, unqualified item, unqualified type, inspection basis, unqualified item standard value, unqualified item actual measurement value, production lot, sampling quantity, sampling base number, product grade, market/platform of the inspected person, related certificate number, sampling serial number, notice number, task source/item name, inventory quantity, approval document number, whether production is permitted, inspection item, whether 3C certification, sampling inspection administrative agency, notice title, notice release date, inspected agency name, inspected agency address, inspected agency contact, inspection time, inspection report number, And (5) bearing and checking mechanisms and remarks.
In step S2, the data item of the consumer product quality spot check information stored in the first form file extracted in step S1 needs to be mapped to the data item field of the designed spot check data table, but because the uncertainty of the name of the data item of the consumer product quality spot check information in the announcement has the same meaning as the description of the data item field designed in advance, i.e. there is a partial text difference, i.e. different case description, in this step, the mapping relationship between the name of the data item in the first form file and the name of the data item in the second form file needs to be configured, and mapping is performed according to the mapping relationship during mapping.
If the data item name and the data item field can be mapped, recording the data item name into the corresponding data item field of the second table file;
and if the data item name and the data item field cannot be mapped, generating a mapping relation between the data item name and the data table field according to a text similarity algorithm and a text classification technology, and updating the generated mapping relation into a first mapping rule. After the generated mapping relation is updated to the first mapping rule, the mapping can be directly performed according to the mapping rule when the same data item name is encountered again. By automatically generating a new mapping relation, the problem that mapping cannot be performed can be automatically solved, manual configuration of rules is not needed, and labor cost is reduced.
The data item fields in the second form document include one or more of a serial number, a spot check classification, a product name, a specification model number, a manufacturing enterprise name, a manufacturing enterprise address, a unified social credit code or organization code, a manufacturing enterprise contact, a manufacturing lot/date, a spot check result, a rejected item, a rejected type, a proof basis, a rejected item standard value, a rejected item measured value, a manufacturing lot quantity, a sampling base number, a product grade, a market/platform of a spot-checked person, a related certificate number, a sampling number, a notice number, a task source/item name, a stock number, an approval document number, whether to produce a license, a check item, whether to perform 3C certification, a spot check administrative agency, a notice title, a notice release date, a name of a checked agency, an address of a checked agency, a name of a spot-checked agency, a spot-checked agency address, a product name of a spot-checked agency, a product name of a production lot number, a product name, a, The contact person of the detected mechanism, the contact mode of the detected mechanism, the detection time, the detection report number, the detection mechanism and the remark.
In an embodiment of the present invention, it is assumed that there are two first table files, one of which includes the data item name: nominal manufacturer, and another first table file includes data item names: name of the nominal production unit. Both of them are understood to mean the manufacturer or the producer. Then the following are corresponded in the second table file of the present invention: the data entry field of the PRODUCTOR corresponds thereto. Mapping PRODUCTOR according to preset mapping rules, namely mapping PRODUCTOR by a nominal manufacturer and mapping PRODUCTOR by a nominal production unit name; the contents corresponding to the nominal manufacturer and nominal production unit names are mapped to the PRODUCTOR in the second table file. If there are also data item names in a first table file: a nominal manufacturing enterprise; and if the mapping relation between the nominal production enterprise and the PRODUCTOR is not recorded in the mapping rule, generating the mapping relation between the nominal production enterprise and the PRODUCTOR according to a text similarity algorithm and a text classification technology, and updating the generated mapping relation into the first mapping rule.
In step S3, according to a preset data cleaning rule, cleaning the data item field in the second form file to obtain a normalized data item field; in the acquired consumption quality information, some meaningless characters exist in the partial data item names, and the meaningless characters can interfere with the subsequent data processing process, so that the meaningless characters need to be cleaned through a data cleaning rule.
In step S4, the normalized data entry field in the second form file is replaced with a normalized data entry field according to a preset third form file, so as to obtain the structured consumption quality information, where the third form file includes the normalized data entry field corresponding to the normalized data entry field.
In the embodiment of the present invention, a third form file is preset, which is used for processing a part of data item fields with standard values, such as the name of a sampling inspection administrative agency, which has a national standard name, but the names of the administrative agency may be short names appearing in the sampling inspection bulletin, if the names are not normalized, it is very inconvenient to apply statistics to subsequent data according to the administrative agency, and even statistical data errors occur, the data item fields recorded in the second form file are matched with the standardized data item fields recorded in the third form file, and the non-standardized data item fields are converted into corresponding standardized data item fields according to the matching result. The method specifically comprises the following steps: and the corresponding relation between the short name of the administrative institution and the national unified standard name of the administrative institution is stored in the third table file, and after mapping is carried out, the short name of the administrative institution is replaced by the national unified standard name of the administrative institution.
And for the data item fields which cannot be converted into the standardization according to the third table file, the corresponding data item fields can be estimated through a text classification technology, and the third table file is perfected. For such data item fields with standard names, the third table file may be passed to the standardized data item fields in this step.
Besides the name of the spot check administrative institution, the unqualified product items in the data item fields also have standardized data item fields, and the non-standardized data item fields in the second form file are converted into the standardized data item fields, so that the subsequent further processing of the second form file can be facilitated.
After the operations of the steps 1 to 4, the consumption quality information acquired from each level of supervision departments is changed into structured data with a uniform format, in a digitalized table, the data item names are changed into uniform data item fields, and the contents in the data item fields are standardized data item fields. Therefore, when a user extracts information according to a certain data item field, extraction errors or omissions caused by multiple expressions of the same content can be avoided, and the accuracy and completeness of data processing are ensured. Meanwhile, the process can be automatically carried out, the data content does not need to be manually adjusted or modified, a large amount of labor cost is saved, the data processing speed and efficiency are improved, and the requirement for carrying out statistical analysis on the quality information of the consumer goods is met
After step S4, that is, after obtaining the structured consumption quality information, the processing method further includes:
step S5, dividing the consumption quality information items with the same core data item field into the same data group to obtain a plurality of data groups with different core data item fields; traversing all the consumption quality information entries in each data group, judging whether the secondary core data entry fields of each consumption quality information entry in the same data group are the same, if the secondary core data entry fields of a plurality of consumption quality information entries are the same, reserving one of the consumption quality information entries, marking the rest consumption quality information entries in a waste state, and if the secondary core data entry fields are different, reserving all the consumption quality information entries to obtain a fourth table file.
Since the quality spot check data of the consumer goods referred to herein is from a network or other third-party collaborator, there may be a problem of data duplicate collection, and data is required to be deduplicated to ensure the accuracy of the data. The repetition rate can be reduced and the data quality can be improved after data deduplication.
It must be emphasized here that embodiments of the present invention differ from prior art deduplication schemes in which the entire row of data is compared to another row of data, i.e., one consumer product quality information item is compared to another consumer product quality information item. Although duplication checking can be performed in this way, the efficiency is low and duplication checking is slow for huge data contents, which is not beneficial to processing consumption quality information.
In addition, in the embodiment of the present invention, since the deduplication is performed after the structured data is obtained, since the data is already structured data, duplicate items can be detected more conveniently and efficiently when comparing deduplication, and deduplication efficiency is improved.
After step S5, that is, after obtaining the fourth form file, the processing method further includes:
step S6, performing data mining on the fourth table file, specifically:
the normalized data entry fields in the structured consumption quality volume information include: enterprise name and product name; digging out the registration address of the enterprise and the standard administrative division of the enterprise according to the name of the enterprise, and recording the excavated registration address of the enterprise and the standard administrative division of the enterprise into a fourth table file; and classifying the product according to the product name to enable the product name to correspond to the standardized consumer goods classification, and recording the corresponding standardized consumer goods classification name into a fourth table file.
In order to meet the statistical requirements of subsequent data application, such as statistics according to administrative divisions, product classification statistics, unqualified statistics or other requirements, the module needs to dig out the registration address of an enterprise and the affiliated standard administrative division according to the name of a production enterprise in the spot check data; and classifying the product names in the spot check data by using a text classification technology, classifying the product names into the standardized consumer goods classification issued by the country, and extracting unqualified structured data of the products according to the classified unqualified configuration rules.
After step S6, the processing method further includes:
step S7, splitting the fourth table file subjected to data mining to obtain a plurality of sub-table files associated with each other.
All contents of the fourth table file are needed during data analysis, so that the fourth table file can be split into corresponding sub-table files according to the analysis requirement, and meanwhile, the incidence relation among the sub-table files is established to facilitate subsequent data analysis operation.
Meanwhile, the fourth table file and the split sub-table file are stored in the database, so that subsequent use is facilitated.
Example 2:
as shown in fig. 2, the present invention discloses a processing apparatus for consuming quality information, comprising: a data acquisition module 101, a first mapping module 102, a data cleansing module 103, and a structuring module 104.
The data acquisition module 101 is configured to acquire a plurality of first table files of consumption quality information, where one piece of consumption product quality information corresponds to one or more first table files; each first table file comprises a plurality of consumption quality quantity information entries, and each consumption quality quantity information entry comprises a plurality of data item names.
The first mapping module 102 is configured to sequentially map each data item name in all the consumption quality information items of each first table file into a preset second table file according to a preset first mapping rule, where the second table file has a plurality of data item fields corresponding to the data item names; the first mapping rule includes a mapping relationship between a data item name and a data item field.
And the data cleaning module 103 is configured to clean the data item fields mapped to the second form file according to a preset data cleaning rule to obtain the normalized data item fields.
The structuring module 104 replaces the normalized data item field in the second form file with a standardized data item field according to a preset third form file to obtain the structured consumption quality information, wherein the third form file includes the standardized data item field corresponding to the normalized data item field.
The processing apparatus further comprises: a data deduplication module 105;
the data deduplication module 105 is configured to, after obtaining the structured consumer product quality information, divide the consumption quality information items having the same core data item field into the same data group, so as to obtain a plurality of data groups having different core data item fields; traversing all the consumption quality information entries in each data group, judging whether the secondary core data entry fields of each consumption quality information entry in the same data group are the same, if the secondary core data entry fields of a plurality of consumption quality information entries are the same, reserving one of the consumption quality information entries, marking the rest consumption quality information entries in a waste state, and if the secondary core data entry fields are different, reserving all the consumption quality information entries to obtain a fourth table file.
The processing apparatus further comprises: a data mining module 106;
the data mining module 106 is configured to perform data mining on the fourth table file, specifically: the standardized data entry field in the structured consumption quality information comprises: enterprise name and product name; digging out the registration address of the enterprise and the standard administrative division of the enterprise according to the name of the enterprise, and recording the excavated registration address of the enterprise and the standard administrative division of the enterprise into a fourth table file; and classifying the product according to the product name to enable the product name to correspond to the standardized consumer goods classification, and recording the corresponding standardized consumer goods classification name into a fourth table file.
The processing apparatus further comprises: a data reconstruction module 107;
the data reconstructing module 107 splits the fourth table file subjected to data mining to obtain a plurality of sub-table files related to each other.
To sum up, the embodiment of the present invention provides a method and an apparatus for processing consumption quality information, and the method and the apparatus of the present invention have the following advantages:
(1) based on a pipeline processing mechanism, a large amount of text information can be processed in a short time, and the input text information file can be continuously structured and stored in a database relational table in a pipeline processing mode only by making processing rules of all steps, so that the processing speed of consumption quality information is improved, the core content of the consumption quality information is extracted, the required storage space is reduced, and the cost is reduced.
(2) The method can intelligently process consumption quality information, structurize the consumption quality information and store the consumption quality information into the database relational table, thereby supporting the work of quick query, batch modification and various statistics and meeting the requirement of big data processing.
(3) Under the condition that the artificially formulated extraction rule and mapping rule are not complete enough, the rule is perfected by a statistical machine learning method, and meanwhile, the segment which cannot automatically carry out information is written into the log for analyzing and correcting the rule, so that the method has certain fault tolerance and robustness.
(4) The method can process files in various formats, including txt, word, excel, html, pdf and the like, does not need to perform additional preprocessing before processing data, and has certain convenience.
(5) The method can run on various mainstream operating systems, including Linux, MacOS and Windows, and has certain portability.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for processing consumption quality information, comprising:
acquiring a plurality of first table files of consumption quality information, wherein one piece of consumption product quality information corresponds to one or more first table files; each first table file comprises a plurality of consumption quality information items, and each consumption quality information item comprises a plurality of data item names;
according to a preset first mapping rule, sequentially mapping each data item name in all consumption quality information items in each first table file to a preset second table file, wherein the second table file comprises a plurality of data item fields corresponding to the data item names; the first mapping rule comprises a mapping relation between a data item name and a data item field;
cleaning the data item fields mapped into the second form file according to a preset data cleaning rule to obtain normalized data item fields;
replacing the normalized data item fields in the second table file with standardized data item fields according to a preset third table file to obtain structured consumption quality information, wherein the third table file comprises the standardized data item fields corresponding to the normalized data item fields; the mapping of each data item name in all consumption quality information items in each first table file to a preset second table file in sequence according to a preset first mapping rule specifically comprises:
if the data item name and the data item field can be mapped, recording the data item name into the corresponding data item field of the second table file;
if the data item name and the data item field can not be mapped, generating a mapping relation between the data item name and the data table field according to a text similarity algorithm and a text classification technology, and updating the generated mapping relation into a first mapping rule;
the first table file for acquiring the plurality of consumption quality information specifically includes:
crawling a plurality of consumption quality quantity spot check notices issued by supervision departments at all levels, removing the weight of the crawled consumption quality quantity spot check report, respectively extracting the text of each consumption quality spot check notice and table data in the accessories after the weight is removed, and respectively storing the extracted table data into a plurality of first table files by taking the consumption quality quantity spot check notices as units; a plurality of first table files storing consumption quality amount information is obtained.
2. The method of claim 1, wherein after obtaining the structured consumption quality information, the method further comprises:
extracting the second form file according to a preset core data item field, and dividing the consumption quality information items with the same core data item field into the same data group to obtain a plurality of data groups with different core data item fields; traversing all the consumption quality information entries in each data group, judging whether the secondary core data entry fields of each consumption quality information entry in the same data group are the same, if the secondary core data entry fields of a plurality of consumption quality information entries are the same, reserving one of the consumption quality information entries, marking the rest consumption quality information entries in a waste state, and if the secondary core data entry fields are different, reserving all the consumption quality information entries to obtain a fourth table file.
3. The method of claim 2, wherein after obtaining the fourth table file, the method further comprises:
and performing data mining on the fourth table file, specifically:
the standardized data entry field in the structured consumption quality information comprises: enterprise name and product name;
digging out the registration address of the enterprise and the standard administrative division of the enterprise according to the name of the enterprise, and recording the excavated registration address of the enterprise and the standard administrative division of the enterprise into a fourth table file;
and classifying the product according to the product name to enable the product name to correspond to the standardized consumer product classification, and recording the corresponding standardized consumer product classification name into a fourth table file.
4. A method of processing consumption quality information according to claim 3, wherein the method further comprises:
and splitting the fourth table file subjected to data mining to obtain a plurality of related sub-table files.
5. A processing apparatus that consumes quality information, comprising: the system comprises a data acquisition module, a first mapping module, a data cleaning module and a structuring module;
the data acquisition module is used for acquiring a plurality of first table files of consumption quality information, wherein one piece of consumption product quality information corresponds to one or more first table files; each first table file comprises a plurality of consumption quality information items, and each consumption quality information item comprises a plurality of data item names;
the first mapping module is used for sequentially mapping each data item name in all the consumption quality information items of each first table file into a preset second table file according to a preset first mapping rule, wherein the second table file has a plurality of data item fields corresponding to the data item names; the first mapping rule comprises a mapping relation between a data item name and a data item field;
the data cleaning module is used for cleaning the data item fields mapped into the second form file according to a preset data cleaning rule to obtain normalized data item fields;
the structural module is used for replacing the normalized data item fields in the second form file with standardized data item fields according to a preset third form file to obtain structured consumption quality information, wherein the third form file comprises the standardized data item fields corresponding to the normalized data item fields;
the mapping of each data item name in all consumption quality information items in each first table file to a preset second table file in sequence according to a preset first mapping rule specifically comprises:
if the data item name and the data item field can be mapped, recording the data item name into the corresponding data item field of the second table file;
if the data item name and the data item field can not be mapped, generating a mapping relation between the data item name and the data table field according to a text similarity algorithm and a text classification technology, and updating the generated mapping relation into a first mapping rule;
the first table file for acquiring the plurality of consumption quality information specifically includes:
crawling a plurality of consumption quality quantity spot check notices issued by supervision departments at all levels, removing the weight of the crawled consumption quality quantity spot check report, respectively extracting the text of each consumption quality spot check notice and table data in the accessories after the weight is removed, and respectively storing the extracted table data into a plurality of first table files by taking the consumption quality quantity spot check notices as units; a plurality of first table files storing consumption quality amount information is obtained.
6. The apparatus for processing consumption quality information as claimed in claim 5, further comprising: a data deduplication module;
the data deduplication module is used for dividing consumption quality information items with the same core data item field into the same data group after obtaining the structured consumption product quality information to obtain a plurality of data groups with different core data item fields; traversing all the consumption quality information entries in each data group, judging whether the secondary core data entry fields of each consumption quality information entry in the same data group are the same, if the secondary core data entry fields of a plurality of consumption quality information entries are the same, reserving one of the consumption quality information entries, marking the rest consumption quality information entries in a waste state, and if the secondary core data entry fields are different, reserving all the consumption quality information entries to obtain a fourth table file.
7. The apparatus for processing consumption quality information as claimed in claim 5, further comprising: a data mining module;
the data mining module is used for performing data mining on the fourth table file, and specifically comprises the following steps: the standardized data entry field in the structured consumption quality information comprises: enterprise name and product name; digging out the registration address of the enterprise and the standard administrative division of the enterprise according to the name of the enterprise, and recording the excavated registration address of the enterprise and the standard administrative division of the enterprise into a fourth table file; and classifying the product according to the product name to enable the product name to correspond to the standardized consumer goods classification, and recording the corresponding standardized consumer goods classification name into a fourth table file.
8. The apparatus for processing consumption quality information according to claim 5, wherein said apparatus further comprises: a data reconstruction module;
and the data reconstruction module is used for splitting the fourth table file subjected to data mining to obtain a plurality of sub-table files which are related to each other.
CN202110598175.7A 2021-05-28 2021-05-28 Method and device for processing consumption quality information Active CN113435701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110598175.7A CN113435701B (en) 2021-05-28 2021-05-28 Method and device for processing consumption quality information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110598175.7A CN113435701B (en) 2021-05-28 2021-05-28 Method and device for processing consumption quality information

Publications (2)

Publication Number Publication Date
CN113435701A CN113435701A (en) 2021-09-24
CN113435701B true CN113435701B (en) 2022-05-31

Family

ID=77803272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110598175.7A Active CN113435701B (en) 2021-05-28 2021-05-28 Method and device for processing consumption quality information

Country Status (1)

Country Link
CN (1) CN113435701B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346377A (en) * 2013-07-31 2015-02-11 克拉玛依红有软件有限责任公司 Method for integrating and exchanging data on basis of unique identification
CN109086260A (en) * 2018-08-29 2018-12-25 中国标准化研究院 Food data processing method and processing device
CN110502516A (en) * 2019-08-22 2019-11-26 深圳前海环融联易信息科技服务有限公司 List data analytic method, device, computer equipment and storage medium
CN110659287A (en) * 2019-09-11 2020-01-07 北京亚信数据有限公司 Method for processing field names of table and computing equipment
CN111061833A (en) * 2019-12-10 2020-04-24 北京明略软件***有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111898905A (en) * 2020-07-28 2020-11-06 霍翔 Quality spot check management method and device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8595245B2 (en) * 2006-07-26 2013-11-26 Xerox Corporation Reference resolution for text enrichment and normalization in mining mixed data
US20160155156A1 (en) * 2012-03-13 2016-06-02 American Express Travel Related Services Company, Inc. Systems and Methods for Presenting Real Time Customized Data to a User
CN111353286A (en) * 2020-03-06 2020-06-30 苏宁云计算有限公司 Table file processing method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346377A (en) * 2013-07-31 2015-02-11 克拉玛依红有软件有限责任公司 Method for integrating and exchanging data on basis of unique identification
CN109086260A (en) * 2018-08-29 2018-12-25 中国标准化研究院 Food data processing method and processing device
CN110502516A (en) * 2019-08-22 2019-11-26 深圳前海环融联易信息科技服务有限公司 List data analytic method, device, computer equipment and storage medium
CN110659287A (en) * 2019-09-11 2020-01-07 北京亚信数据有限公司 Method for processing field names of table and computing equipment
CN111061833A (en) * 2019-12-10 2020-04-24 北京明略软件***有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN111898905A (en) * 2020-07-28 2020-11-06 霍翔 Quality spot check management method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113435701A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN105868373B (en) Method and device for processing key data of power business information system
CN112182246B (en) Method, system, medium, and application for creating an enterprise representation through big data analysis
CN112445875B (en) Data association and verification method and device, electronic equipment and storage medium
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN109684322B (en) Data processing system and method for automatic maritime affair auditing
Hamad et al. An enhanced technique to clean data in the data warehouse
CN112100149B (en) Automatic log analysis system
CN112000656A (en) Intelligent data cleaning method and device based on metadata
CN111159161A (en) ETL rule-based data quality monitoring and early warning system and method
CN111125116A (en) Method and system for positioning code field in service table and corresponding code table
CN114880405A (en) Data lake-based data processing method and system
CN112817958A (en) Electric power planning data acquisition method and device and intelligent terminal
CN115185888A (en) Enterprise environment-friendly archive management method, device, equipment and storage medium
CN115794798A (en) Market supervision informationized standard management and dynamic maintenance system and method
CN115132366A (en) Multi-source data processing method and system based on health and medical big data standard library
CN113918707A (en) Policy convergence and enterprise image matching recommendation method
CN113377758A (en) Data quality auditing engine and auditing method thereof
CN113435701B (en) Method and device for processing consumption quality information
Hinrichs et al. An ISO 9001: 2000 Compliant Quality Management System for Data Integration in Data Warehouse Systems.
CN113806311B (en) File classification method and device based on deep learning, electronic equipment and medium
CN115098585A (en) Automatic law and regulation data processing method and system based on big data
CN110597899B (en) Project expense management method and system
CN113609848A (en) Industrial product quality safety supervision method and device
CN112416904A (en) Electric power data standardization processing method and device
CN112380264A (en) Policy analysis and matching method and device based on personal full life cycle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant