CN113641815B

CN113641815B - Data screening method and device based on big data and electronic equipment

Info

Publication number: CN113641815B
Application number: CN202110845992.8A
Authority: CN
Inventors: 吴博; 朱昕宇; 刘宜帆; 周春辉
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2023-06-13
Anticipated expiration: 2041-07-26
Also published as: CN113641815A

Abstract

The invention relates to a data screening method, a device, electronic equipment and a computer readable storage medium based on big data, wherein the method comprises the steps of obtaining screening conditions, screening data to be screened according to the screening conditions, and obtaining a data document corresponding to the screening conditions; extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information; and prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result. The data screening method based on big data can simplify the data screening operation process and improve the data screening efficiency.

Description

Data screening method and device based on big data and electronic equipment

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a data screening method and apparatus based on big data, an electronic device, and a computer readable storage medium.

Background

Along with the development of big data environment, data are rapidly accumulated, the value contained in mass data is analyzed, and the screening of valuable data is very important, so that the screening of visible data is in a vital position in the whole data processing flow. For example, in the e-commerce field, data documents containing condition, date, age and product specification information are screened. The purpose of data screening is to improve the usability of previously collected and stored relevant data, and to facilitate later data analysis.

The method for realizing data screening in the prior art adopts a mode of exporting data through an excel form and then manually screening, and the method for realizing data screening disclosed in the prior art is characterized in that customized screening configuration is carried out on required configuration information in a web page, and a corresponding data screening template is generated for data screening, so that manual multiple exporting and screening are not needed.

However, the data screening method does not perform sorting processing on the acquired data, and has the problems that the acquired data is redundant and the data is difficult to visually observe, so that the data screening operation process is complex, and the data screening efficiency is low.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a data screening method, apparatus, electronic device and computer readable storage medium based on big data, so as to solve the problems of complex data screening operation process and low data screening efficiency in the big data document in the e-commerce field in the prior art.

In order to solve the above problems, the present invention provides a data screening method based on big data, including:

acquiring screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;

extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

and prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result.

Further, obtaining screening conditions, and screening data to be screened according to the screening conditions, specifically including:

and taking at least one of the characters, the character strings and the hypertext links as initial screening conditions, taking at least one of the conditions, the date, the age and the product specification information as rescreening conditions, and screening the data to be screened according to the initial screening conditions and the rescreening conditions.

Further, the extracting the data document by using the inverted index to obtain user screening information specifically includes:

numbering the data documents, dividing the interior of each data document into a plurality of words, forming a corresponding relation between each word and the number of the data document by using an inverted index, and obtaining user screening information by retrieving and extracting the data documents.

Further, the method for obtaining the user screening information by retrieving and extracting the data document by using the inverted index to enable each word to form a corresponding relation with the data document number specifically comprises the following steps:

inverted indexing is carried out on the data document by adopting a hash table structure so as to obtain the corresponding relation from the word to all the data document numbers containing the word;

the screening conditions are disassembled into a plurality of words, and the numbers of all data documents containing the words corresponding to the screening conditions are inquired according to the corresponding relation;

and taking intersections of the numbers of all the queried data documents to obtain user screening information.

Further, the inverted indexing of the data document by adopting the hash table structure to obtain the corresponding relation from the word to all the data document numbers containing the word specifically includes:

and sequentially accessing each data document, acquiring the value of each word in the data document in a hash table, and inserting the value in the data document number so as to form a corresponding relation from the word to all the data document numbers containing the word.

Further, the prioritizing the cleaned user screening information according to a query condition of a pre-calibrated user specifically includes:

and acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information, and prioritizing the cleaned user screening information according to the relative frequency.

Further, obtaining a relative frequency of occurrence of a query condition of a pre-calibrated user in a data document in the cleaned user screening information, and prioritizing the cleaned user screening information according to the relative frequency, specifically including:

acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information according to the query condition of the pre-calibrated user and the sequencing feature function, and prioritizing the cleaned user screening information according to the relative frequency;

the sorting characteristic function is

Wherein q is the query condition of the pre-calibrated user, d is the data document in the cleaned user screening information, f _i (d, q) is a relative frequency of occurrence of the ith word in the query condition q of the pre-calibrated user in the data document d, f _t (t _i D) is the word t _i The relative frequency of occurrence in the data document d is that V is the number of data documents selected according to the query condition of the pre-calibrated user, N is that part of the cleaned user screening information is selected as the number of training data documents, N _t And screening the total number of the data documents in the information for the cleaned user.

The invention also provides a data screening device based on big data, which comprises a data screening module, an information extraction module and a priority ordering module;

the data screening module is used for acquiring screening conditions, screening the data to be screened according to the screening conditions and acquiring data documents corresponding to the screening conditions;

the information extraction module is used for extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

and the prioritization module is used for prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritization result.

The invention also provides electronic equipment, which comprises a processor and a memory, wherein the memory is stored with a computer program, and when the computer program is executed by the processor, the data screening method based on big data according to any one of the technical schemes is realized.

The invention also provides a computer readable storage medium, on which a computer program is stored, characterized in that the data screening method based on big data according to any of the above-mentioned technical solutions is implemented when the computer program is executed by a processor.

The beneficial effects of adopting the embodiment are as follows: the data screening method based on big data provided by the invention screens data to be screened according to screening conditions input by a user to obtain related data documents to finish screening, in the specific implementation process, the data documents are numbered by using the inverted index, visual observation is convenient for the data documents, the data documents are extracted by using Boolean search to obtain user screening information, the user screening information is cleaned, the purpose of checking the user screening information is achieved, the cleaned user screening information is prioritized, a chart is generated by using a chart library, the data screening operation process can be simplified, and the data screening efficiency is improved.

Drawings

Fig. 1 is a schematic diagram of an application scenario of a data screening device based on big data provided by the invention;

FIG. 2 is a flow chart of an embodiment of a data screening method based on big data according to the present invention;

FIG. 3 is a schematic diagram of a Boolean search method according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating an embodiment of a big data based data screening apparatus according to the present invention;

fig. 5 is a block diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.

The invention provides a data screening method and device based on big data, electronic equipment and a computer readable storage medium, and the data screening method and device based on big data, the electronic equipment and the computer readable storage medium are respectively described in detail below.

Fig. 1 is a schematic diagram of an application scenario of a data filtering device based on big data provided by the present invention, where the system may include a server 100, and the server 100 is integrated with the data filtering device based on big data, such as the server in fig. 1.

The server 100 in the embodiment of the present invention is mainly used for:

In the embodiment of the present invention, the server 100 may be an independent server, or may be a server network or a server cluster formed by servers, for example, the server 100 described in the embodiment of the present invention includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud server formed by a plurality of servers. Wherein the Cloud server is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing).

It will be appreciated that the terminal 200 used in embodiments of the present invention may be a device that includes both receive and transmit hardware, i.e., a device having receive and transmit hardware capable of performing bi-directional communications over a bi-directional communication link. Such a device may include: a cellular or other communication device having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display. The specific terminal 200 may be a desktop computer, a portable computer, a web server, a palm computer (Personal Digital Assistant, PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc., and the embodiment is not limited to the type of the terminal 200.

It will be appreciated by those skilled in the art that the application environment shown in fig. 1 is merely an application scenario of the present invention, and is not limited to the application scenario of the present invention, and other application environments may also include more or fewer terminals than those shown in fig. 1, for example, only 2 terminals are shown in fig. 1, and it will be appreciated that the data screening apparatus based on big data may also include one or more other terminals, which is not limited herein.

In addition, referring to fig. 1, the big data based data screening apparatus may further include a memory 200 for storing data such as condition, date, age, and product specification.

It should be noted that, the schematic view of the scenario of the big data based data filtering device shown in fig. 1 is only an example, and the big data based data filtering device and scenario described in the embodiments of the present invention are for more clearly describing the technical solution of the embodiments of the present invention, and do not constitute a limitation to the technical solution provided by the embodiments of the present invention, and as the big data based data filtering device evolves and new service scenarios appear, those skilled in the art can know that the technical solution provided by the embodiments of the present invention is equally applicable to similar technical problems.

The embodiment of the invention provides a data screening method based on big data, which is shown in a flow chart in fig. 2, and comprises the following steps:

step S201, obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;

step S202, extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

and step 203, prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result.

In a specific embodiment, the user screening information is cleaned to obtain cleaned user screening information, and the specific process is that deleting and filling operation is performed on the data document in the user screening information by using a Bayesian formula or a decision tree method, and the obtained data document is transferred into a data warehouse according to a data document format in the data warehouse to obtain cleaned user screening information.

As a preferred embodiment, obtaining screening conditions, and screening data to be screened according to the screening conditions specifically includes:

In a specific embodiment, a user input text box and a screening condition box are respectively provided through a user input module and a condition screening module, a user inputs keywords, key character strings or key hypertext connections in the input text box as initial screening conditions, and the screening conditions are checked in the screening condition box as secondary screening conditions to screen data to obtain data documents corresponding to the screening conditions.

As a preferred embodiment, the extracting the data document by using the inverted index to obtain the user filtering information specifically includes:

In a specific embodiment, each data document is numbered and arranged in sequence from 0, each data document is divided into a plurality of words by using a grammar analyzer, and each word and the data document number form a corresponding relation by using an inverted index, so that the acquired data document is prevented from being excessively redundant, and visual observation is facilitated.

As a preferred embodiment, the method for obtaining the user screening information by retrieving and extracting the data document by using the inverted index to form a corresponding relation between each word and the number of the data document specifically comprises the following steps:

The data document is extracted by a boolean search method.

As a preferred embodiment, the inverted indexing of the data document by using the hash table structure to obtain the correspondence from the word to all the data document numbers containing the word specifically includes:

In a specific embodiment, a schematic diagram of a boolean search method is shown in fig. 3, where boolean search is performed on a data document including a screening condition a and a data document including a screening condition B, where all data documents are numbered in advance and may be numbered in a format of category+time node+number; taking the user query as a screening node when each time screening conditions are checked, and acquiring data to be screened; screening the data to be screened to obtain all data document numbers meeting the screening conditions checked by the user; and taking intersection of the numbers of the screened data documents to obtain user screening information.

As a preferred embodiment, the prioritizing the cleaned user screening information according to a query condition of a pre-calibrated user specifically includes:

It should be noted that, in this embodiment, the advantage of prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user is to avoid too many data documents in the cleaned user screening information, which is not beneficial to visual observation, so as to improve the screening efficiency of screening the cleaned user screening information.

As a preferred embodiment, obtaining a relative frequency of occurrence of a query condition of a pre-calibrated user in a data document in the cleaned user screening information, and prioritizing the cleaned user screening information according to the relative frequency, specifically including:

the sorting characteristic function is

In one particular embodiment, the query condition of the user is pre-calibrated to be q= { W1, W2..times.ws }, the data file in the user screening information after cleaning is taken as candidate files as d= { d1, d 2..dk }, q is the screening condition, d is the screened data file, and a score is calculated for q and d: s is S _k ＝score(q,d _k ) Will f _i (d, q) is set to pre-calibrate the relative frequency of occurrence of the ith word in the query condition q of the user in the data document d, the relative frequencyThe rate, expressed as

Wherein f _i (d, q) is a ranking feature function, understood as the weight of the ith word in query condition q in candidate document d, f _t (t _i D) is the word t _i The relative frequency of occurrence in the candidate document d is that V is the number of data files selected according to the query condition of the pre-calibrated user, N is that part of the cleaned user screening information is selected as the number of training data files, N _t Screening the total number of data documents in the information for the cleaned user, wherein the denominator is a normalization factor, f _i The value of (d, q) is denoted as S _k The higher the score of (c) is, the higher the ranking is.

In another embodiment, the generating a chart of the prioritized results by the template generating module includes obtaining a screening condition, obtaining a screened data document, forming a mapping relation between the screening condition and the screened data document, and calling a chart library to generate the chart by the mapping relation.

The embodiment of the invention provides a data screening device based on big data, which is structurally characterized in that the data screening device based on big data comprises a data screening module 401, an information extraction module 402 and a priority ordering module 403 as shown in fig. 4;

the data screening module 401 is configured to obtain a screening condition, and screen the data to be screened according to the screening condition to obtain a data document corresponding to the screening condition;

the information extraction module 402 is configured to extract the data document by using an inverted index to obtain user screening information, and clean the user screening information to obtain cleaned user screening information;

the prioritization module 403 is configured to prioritize the cleaned user screening information according to a query condition of a pre-calibrated user, so as to obtain a prioritization result.

The data filtering module 401 includes a user input module, a condition filtering module, and an information uploading module, where the user input module and the condition filtering module are respectively configured to provide a user input text box and a filtering condition box, the user inputs a keyword, a key character string or a key hypertext connection in the input text box as an initial filtering condition, and the filtering condition box checks the filtering condition as a re-filtering condition to obtain a filtering condition, and the information uploading module is configured to upload the obtained filtering condition to the cloud database;

the information extraction module 402 performs information extraction operation in a cloud database, including obtaining screening condition information uploaded by an information uploading module, performing inverted index extraction on a data document according to the screening condition information, performing boolean search on the extracted data document to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;

the prioritization module 403 includes a prioritization module, a template generation module, and an information acquisition module, where the information acquisition module is configured to acquire data documents in the cleaned user screening information, the prioritization module is configured to prioritize the data documents in the cleaned user screening information according to a query condition of a pre-calibrated user, and the template generation module is configured to generate a chart for a prioritization result, and complete data screening.

As shown in fig. 5, the present invention further provides an electronic device, which may be a mobile terminal, a desktop computer, a notebook computer, a palm computer, a server, or other computing devices. The electronic device includes a processor 10, a memory 20, and a display 30.

The memory 20 may in some embodiments be an internal storage unit of a computer device, such as a hard disk or memory of a computer device. The memory 20 may also be an external storage device of the computer device in other embodiments, such as a plug-in hard disk provided on the computer device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. Further, the memory 20 may also include both internal storage units and external storage devices of the computer device. The memory 20 is used for storing application software installed on the computer device and various types of data, such as program codes for installing the computer device. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a big data based data screening method program 40, and the big data based data screening method program 40 may be executed by the processor 10, thereby implementing the big data based data screening method according to the embodiments of the present invention.

The processor 10 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 20, e.g. performing big data based data screening methods etc.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display 30 is for displaying information at the computer device and for displaying a visual user interface. The components 10-30 of the computer device communicate with each other via a system bus.

In one embodiment, the following steps are implemented when the processor 10 executes the big data based data screening method program 40 in the memory 20:

The present embodiment also provides a computer-readable storage medium having stored thereon a data screening method program based on big data, which when executed by a processor, implements the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A data screening method based on big data, comprising:

acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information according to the query condition of the pre-calibrated user and the sequencing feature function, and sequencing the cleaned user screening information preferentially according to the relative frequency to obtain a sequencing result;

the sorting characteristic function is

2. The big data-based data screening method according to claim 1, wherein screening conditions are obtained, and screening of the data to be screened is performed according to the screening conditions, specifically comprising:

3. The big data based data filtering method according to claim 1, wherein the extracting the data document by using the inverted index to obtain the user filtering information specifically includes:

4. The data filtering method based on big data according to claim 3, wherein the using the inverted index to make each word form a corresponding relation with the number of the data document, and obtaining the user filtering information by retrieving and extracting the data document, specifically comprises:

5. The method for screening data based on big data according to claim 4, wherein the inverted indexing of the data document using the hash table structure to obtain the correspondence from the word to all the data document numbers containing the word specifically comprises:

and sequentially accessing each data document, acquiring the value of each word in the data document in a hash table, and inserting the value in the data document number to determine the corresponding relation from the word to all the data document numbers containing the word.

6. The data screening device based on big data is characterized by comprising a data screening module, an information extraction module and a priority ordering module;

the data screening module is used for acquiring screening conditions, screening data to be screened according to the screening conditions and acquiring data documents corresponding to the screening conditions;

the prioritization module is used for acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information according to the query condition of the pre-calibrated user and the sequencing characteristic function, and prioritizing the cleaned user screening information according to the relative frequency to obtain a prioritization result;

the sorting characteristic function is

7. An electronic device comprising a processor and a memory, wherein the memory has stored thereon a computer program which, when executed by the processor, implements the big data based data screening method according to any of claims 1-5.

8. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the big data based data screening method according to any of claims 1-5.