CN113641815B - Data screening method and device based on big data and electronic equipment - Google Patents

Data screening method and device based on big data and electronic equipment Download PDF

Info

Publication number
CN113641815B
CN113641815B CN202110845992.8A CN202110845992A CN113641815B CN 113641815 B CN113641815 B CN 113641815B CN 202110845992 A CN202110845992 A CN 202110845992A CN 113641815 B CN113641815 B CN 113641815B
Authority
CN
China
Prior art keywords
data
screening
user
information
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110845992.8A
Other languages
Chinese (zh)
Other versions
CN113641815A (en
Inventor
吴博
朱昕宇
刘宜帆
周春辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202110845992.8A priority Critical patent/CN113641815B/en
Publication of CN113641815A publication Critical patent/CN113641815A/en
Application granted granted Critical
Publication of CN113641815B publication Critical patent/CN113641815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data screening method, a device, electronic equipment and a computer readable storage medium based on big data, wherein the method comprises the steps of obtaining screening conditions, screening data to be screened according to the screening conditions, and obtaining a data document corresponding to the screening conditions; extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information; and prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result. The data screening method based on big data can simplify the data screening operation process and improve the data screening efficiency.

Description

Data screening method and device based on big data and electronic equipment
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a data screening method and apparatus based on big data, an electronic device, and a computer readable storage medium.
Background
Along with the development of big data environment, data are rapidly accumulated, the value contained in mass data is analyzed, and the screening of valuable data is very important, so that the screening of visible data is in a vital position in the whole data processing flow. For example, in the e-commerce field, data documents containing condition, date, age and product specification information are screened. The purpose of data screening is to improve the usability of previously collected and stored relevant data, and to facilitate later data analysis.
The method for realizing data screening in the prior art adopts a mode of exporting data through an excel form and then manually screening, and the method for realizing data screening disclosed in the prior art is characterized in that customized screening configuration is carried out on required configuration information in a web page, and a corresponding data screening template is generated for data screening, so that manual multiple exporting and screening are not needed.
However, the data screening method does not perform sorting processing on the acquired data, and has the problems that the acquired data is redundant and the data is difficult to visually observe, so that the data screening operation process is complex, and the data screening efficiency is low.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a data screening method, apparatus, electronic device and computer readable storage medium based on big data, so as to solve the problems of complex data screening operation process and low data screening efficiency in the big data document in the e-commerce field in the prior art.
In order to solve the above problems, the present invention provides a data screening method based on big data, including:
acquiring screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result.
Further, obtaining screening conditions, and screening data to be screened according to the screening conditions, specifically including:
and taking at least one of the characters, the character strings and the hypertext links as initial screening conditions, taking at least one of the conditions, the date, the age and the product specification information as rescreening conditions, and screening the data to be screened according to the initial screening conditions and the rescreening conditions.
Further, the extracting the data document by using the inverted index to obtain user screening information specifically includes:
numbering the data documents, dividing the interior of each data document into a plurality of words, forming a corresponding relation between each word and the number of the data document by using an inverted index, and obtaining user screening information by retrieving and extracting the data documents.
Further, the method for obtaining the user screening information by retrieving and extracting the data document by using the inverted index to enable each word to form a corresponding relation with the data document number specifically comprises the following steps:
inverted indexing is carried out on the data document by adopting a hash table structure so as to obtain the corresponding relation from the word to all the data document numbers containing the word;
the screening conditions are disassembled into a plurality of words, and the numbers of all data documents containing the words corresponding to the screening conditions are inquired according to the corresponding relation;
and taking intersections of the numbers of all the queried data documents to obtain user screening information.
Further, the inverted indexing of the data document by adopting the hash table structure to obtain the corresponding relation from the word to all the data document numbers containing the word specifically includes:
and sequentially accessing each data document, acquiring the value of each word in the data document in a hash table, and inserting the value in the data document number so as to form a corresponding relation from the word to all the data document numbers containing the word.
Further, the prioritizing the cleaned user screening information according to a query condition of a pre-calibrated user specifically includes:
and acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information, and prioritizing the cleaned user screening information according to the relative frequency.
Further, obtaining a relative frequency of occurrence of a query condition of a pre-calibrated user in a data document in the cleaned user screening information, and prioritizing the cleaned user screening information according to the relative frequency, specifically including:
acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information according to the query condition of the pre-calibrated user and the sequencing feature function, and prioritizing the cleaned user screening information according to the relative frequency;
the sorting characteristic function is
Figure BDA0003180576470000031
Wherein q is the query condition of the pre-calibrated user, d is the data document in the cleaned user screening information, f i (d, q) is a relative frequency of occurrence of the ith word in the query condition q of the pre-calibrated user in the data document d, f t (t i D) is the word t i The relative frequency of occurrence in the data document d is that V is the number of data documents selected according to the query condition of the pre-calibrated user, N is that part of the cleaned user screening information is selected as the number of training data documents, N t And screening the total number of the data documents in the information for the cleaned user.
The invention also provides a data screening device based on big data, which comprises a data screening module, an information extraction module and a priority ordering module;
the data screening module is used for acquiring screening conditions, screening the data to be screened according to the screening conditions and acquiring data documents corresponding to the screening conditions;
the information extraction module is used for extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and the prioritization module is used for prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritization result.
The invention also provides electronic equipment, which comprises a processor and a memory, wherein the memory is stored with a computer program, and when the computer program is executed by the processor, the data screening method based on big data according to any one of the technical schemes is realized.
The invention also provides a computer readable storage medium, on which a computer program is stored, characterized in that the data screening method based on big data according to any of the above-mentioned technical solutions is implemented when the computer program is executed by a processor.
The beneficial effects of adopting the embodiment are as follows: the data screening method based on big data provided by the invention screens data to be screened according to screening conditions input by a user to obtain related data documents to finish screening, in the specific implementation process, the data documents are numbered by using the inverted index, visual observation is convenient for the data documents, the data documents are extracted by using Boolean search to obtain user screening information, the user screening information is cleaned, the purpose of checking the user screening information is achieved, the cleaned user screening information is prioritized, a chart is generated by using a chart library, the data screening operation process can be simplified, and the data screening efficiency is improved.
Drawings
Fig. 1 is a schematic diagram of an application scenario of a data screening device based on big data provided by the invention;
FIG. 2 is a flow chart of an embodiment of a data screening method based on big data according to the present invention;
FIG. 3 is a schematic diagram of a Boolean search method according to an embodiment of the present invention;
FIG. 4 is a block diagram illustrating an embodiment of a big data based data screening apparatus according to the present invention;
fig. 5 is a block diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and together with the description serve to explain the principles of the invention, and are not intended to limit the scope of the invention.
The invention provides a data screening method and device based on big data, electronic equipment and a computer readable storage medium, and the data screening method and device based on big data, the electronic equipment and the computer readable storage medium are respectively described in detail below.
Fig. 1 is a schematic diagram of an application scenario of a data filtering device based on big data provided by the present invention, where the system may include a server 100, and the server 100 is integrated with the data filtering device based on big data, such as the server in fig. 1.
The server 100 in the embodiment of the present invention is mainly used for:
acquiring screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result.
In the embodiment of the present invention, the server 100 may be an independent server, or may be a server network or a server cluster formed by servers, for example, the server 100 described in the embodiment of the present invention includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud server formed by a plurality of servers. Wherein the Cloud server is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing).
It will be appreciated that the terminal 200 used in embodiments of the present invention may be a device that includes both receive and transmit hardware, i.e., a device having receive and transmit hardware capable of performing bi-directional communications over a bi-directional communication link. Such a device may include: a cellular or other communication device having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display. The specific terminal 200 may be a desktop computer, a portable computer, a web server, a palm computer (Personal Digital Assistant, PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc., and the embodiment is not limited to the type of the terminal 200.
It will be appreciated by those skilled in the art that the application environment shown in fig. 1 is merely an application scenario of the present invention, and is not limited to the application scenario of the present invention, and other application environments may also include more or fewer terminals than those shown in fig. 1, for example, only 2 terminals are shown in fig. 1, and it will be appreciated that the data screening apparatus based on big data may also include one or more other terminals, which is not limited herein.
In addition, referring to fig. 1, the big data based data screening apparatus may further include a memory 200 for storing data such as condition, date, age, and product specification.
It should be noted that, the schematic view of the scenario of the big data based data filtering device shown in fig. 1 is only an example, and the big data based data filtering device and scenario described in the embodiments of the present invention are for more clearly describing the technical solution of the embodiments of the present invention, and do not constitute a limitation to the technical solution provided by the embodiments of the present invention, and as the big data based data filtering device evolves and new service scenarios appear, those skilled in the art can know that the technical solution provided by the embodiments of the present invention is equally applicable to similar technical problems.
The embodiment of the invention provides a data screening method based on big data, which is shown in a flow chart in fig. 2, and comprises the following steps:
step S201, obtaining screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
step S202, extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and step 203, prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result.
In a specific embodiment, the user screening information is cleaned to obtain cleaned user screening information, and the specific process is that deleting and filling operation is performed on the data document in the user screening information by using a Bayesian formula or a decision tree method, and the obtained data document is transferred into a data warehouse according to a data document format in the data warehouse to obtain cleaned user screening information.
As a preferred embodiment, obtaining screening conditions, and screening data to be screened according to the screening conditions specifically includes:
and taking at least one of the characters, the character strings and the hypertext links as initial screening conditions, taking at least one of the conditions, the date, the age and the product specification information as rescreening conditions, and screening the data to be screened according to the initial screening conditions and the rescreening conditions.
In a specific embodiment, a user input text box and a screening condition box are respectively provided through a user input module and a condition screening module, a user inputs keywords, key character strings or key hypertext connections in the input text box as initial screening conditions, and the screening conditions are checked in the screening condition box as secondary screening conditions to screen data to obtain data documents corresponding to the screening conditions.
As a preferred embodiment, the extracting the data document by using the inverted index to obtain the user filtering information specifically includes:
numbering the data documents, dividing the interior of each data document into a plurality of words, forming a corresponding relation between each word and the number of the data document by using an inverted index, and obtaining user screening information by retrieving and extracting the data documents.
In a specific embodiment, each data document is numbered and arranged in sequence from 0, each data document is divided into a plurality of words by using a grammar analyzer, and each word and the data document number form a corresponding relation by using an inverted index, so that the acquired data document is prevented from being excessively redundant, and visual observation is facilitated.
As a preferred embodiment, the method for obtaining the user screening information by retrieving and extracting the data document by using the inverted index to form a corresponding relation between each word and the number of the data document specifically comprises the following steps:
inverted indexing is carried out on the data document by adopting a hash table structure so as to obtain the corresponding relation from the word to all the data document numbers containing the word;
the screening conditions are disassembled into a plurality of words, and the numbers of all data documents containing the words corresponding to the screening conditions are inquired according to the corresponding relation;
and taking intersections of the numbers of all the queried data documents to obtain user screening information.
The data document is extracted by a boolean search method.
As a preferred embodiment, the inverted indexing of the data document by using the hash table structure to obtain the correspondence from the word to all the data document numbers containing the word specifically includes:
and sequentially accessing each data document, acquiring the value of each word in the data document in a hash table, and inserting the value in the data document number so as to form a corresponding relation from the word to all the data document numbers containing the word.
In a specific embodiment, a schematic diagram of a boolean search method is shown in fig. 3, where boolean search is performed on a data document including a screening condition a and a data document including a screening condition B, where all data documents are numbered in advance and may be numbered in a format of category+time node+number; taking the user query as a screening node when each time screening conditions are checked, and acquiring data to be screened; screening the data to be screened to obtain all data document numbers meeting the screening conditions checked by the user; and taking intersection of the numbers of the screened data documents to obtain user screening information.
As a preferred embodiment, the prioritizing the cleaned user screening information according to a query condition of a pre-calibrated user specifically includes:
and acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information, and prioritizing the cleaned user screening information according to the relative frequency.
It should be noted that, in this embodiment, the advantage of prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user is to avoid too many data documents in the cleaned user screening information, which is not beneficial to visual observation, so as to improve the screening efficiency of screening the cleaned user screening information.
As a preferred embodiment, obtaining a relative frequency of occurrence of a query condition of a pre-calibrated user in a data document in the cleaned user screening information, and prioritizing the cleaned user screening information according to the relative frequency, specifically including:
acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information according to the query condition of the pre-calibrated user and the sequencing feature function, and prioritizing the cleaned user screening information according to the relative frequency;
the sorting characteristic function is
Figure BDA0003180576470000091
Wherein q is the query condition of the pre-calibrated user, d is the data document in the cleaned user screening information, f i (d, q) is a relative frequency of occurrence of the ith word in the query condition q of the pre-calibrated user in the data document d, f t (t i D) is the word t i The relative frequency of occurrence in the data document d is that V is the number of data documents selected according to the query condition of the pre-calibrated user, N is that part of the cleaned user screening information is selected as the number of training data documents, N t And screening the total number of the data documents in the information for the cleaned user.
In one particular embodiment, the query condition of the user is pre-calibrated to be q= { W1, W2..times.ws }, the data file in the user screening information after cleaning is taken as candidate files as d= { d1, d 2..dk }, q is the screening condition, d is the screened data file, and a score is calculated for q and d: s is S k =score(q,d k ) Will f i (d, q) is set to pre-calibrate the relative frequency of occurrence of the ith word in the query condition q of the user in the data document d, the relative frequencyThe rate, expressed as
Figure BDA0003180576470000101
Wherein f i (d, q) is a ranking feature function, understood as the weight of the ith word in query condition q in candidate document d, f t (t i D) is the word t i The relative frequency of occurrence in the candidate document d is that V is the number of data files selected according to the query condition of the pre-calibrated user, N is that part of the cleaned user screening information is selected as the number of training data files, N t Screening the total number of data documents in the information for the cleaned user, wherein the denominator is a normalization factor, f i The value of (d, q) is denoted as S k The higher the score of (c) is, the higher the ranking is.
In another embodiment, the generating a chart of the prioritized results by the template generating module includes obtaining a screening condition, obtaining a screened data document, forming a mapping relation between the screening condition and the screened data document, and calling a chart library to generate the chart by the mapping relation.
The embodiment of the invention provides a data screening device based on big data, which is structurally characterized in that the data screening device based on big data comprises a data screening module 401, an information extraction module 402 and a priority ordering module 403 as shown in fig. 4;
the data screening module 401 is configured to obtain a screening condition, and screen the data to be screened according to the screening condition to obtain a data document corresponding to the screening condition;
the information extraction module 402 is configured to extract the data document by using an inverted index to obtain user screening information, and clean the user screening information to obtain cleaned user screening information;
the prioritization module 403 is configured to prioritize the cleaned user screening information according to a query condition of a pre-calibrated user, so as to obtain a prioritization result.
The data filtering module 401 includes a user input module, a condition filtering module, and an information uploading module, where the user input module and the condition filtering module are respectively configured to provide a user input text box and a filtering condition box, the user inputs a keyword, a key character string or a key hypertext connection in the input text box as an initial filtering condition, and the filtering condition box checks the filtering condition as a re-filtering condition to obtain a filtering condition, and the information uploading module is configured to upload the obtained filtering condition to the cloud database;
the information extraction module 402 performs information extraction operation in a cloud database, including obtaining screening condition information uploaded by an information uploading module, performing inverted index extraction on a data document according to the screening condition information, performing boolean search on the extracted data document to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
the prioritization module 403 includes a prioritization module, a template generation module, and an information acquisition module, where the information acquisition module is configured to acquire data documents in the cleaned user screening information, the prioritization module is configured to prioritize the data documents in the cleaned user screening information according to a query condition of a pre-calibrated user, and the template generation module is configured to generate a chart for a prioritization result, and complete data screening.
As shown in fig. 5, the present invention further provides an electronic device, which may be a mobile terminal, a desktop computer, a notebook computer, a palm computer, a server, or other computing devices. The electronic device includes a processor 10, a memory 20, and a display 30.
The memory 20 may in some embodiments be an internal storage unit of a computer device, such as a hard disk or memory of a computer device. The memory 20 may also be an external storage device of the computer device in other embodiments, such as a plug-in hard disk provided on the computer device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. Further, the memory 20 may also include both internal storage units and external storage devices of the computer device. The memory 20 is used for storing application software installed on the computer device and various types of data, such as program codes for installing the computer device. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a big data based data screening method program 40, and the big data based data screening method program 40 may be executed by the processor 10, thereby implementing the big data based data screening method according to the embodiments of the present invention.
The processor 10 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 20, e.g. performing big data based data screening methods etc.
The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display 30 is for displaying information at the computer device and for displaying a visual user interface. The components 10-30 of the computer device communicate with each other via a system bus.
In one embodiment, the following steps are implemented when the processor 10 executes the big data based data screening method program 40 in the memory 20:
acquiring screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result.
The present embodiment also provides a computer-readable storage medium having stored thereon a data screening method program based on big data, which when executed by a processor, implements the steps of:
acquiring screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
and prioritizing the cleaned user screening information according to the query condition of the pre-calibrated user to obtain a prioritized result.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims (8)

1. A data screening method based on big data, comprising:
acquiring screening conditions, and screening data to be screened according to the screening conditions to obtain data documents corresponding to the screening conditions;
extracting the data document by using an inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information according to the query condition of the pre-calibrated user and the sequencing feature function, and sequencing the cleaned user screening information preferentially according to the relative frequency to obtain a sequencing result;
the sorting characteristic function is
Figure QLYQS_1
Wherein q is the query condition of the pre-calibrated user, d is the data document in the cleaned user screening information, f i (d, q) is a relative frequency of occurrence of the ith word in the query condition q of the pre-calibrated user in the data document d, f t (t i D) is the word t i The relative frequency of occurrence in the data document d is that V is the number of data documents selected according to the query condition of the pre-calibrated user, N is that part of the cleaned user screening information is selected as the number of training data documents, N t And screening the total number of the data documents in the information for the cleaned user.
2. The big data-based data screening method according to claim 1, wherein screening conditions are obtained, and screening of the data to be screened is performed according to the screening conditions, specifically comprising:
and taking at least one of the characters, the character strings and the hypertext links as initial screening conditions, taking at least one of the conditions, the date, the age and the product specification information as rescreening conditions, and screening the data to be screened according to the initial screening conditions and the rescreening conditions.
3. The big data based data filtering method according to claim 1, wherein the extracting the data document by using the inverted index to obtain the user filtering information specifically includes:
numbering the data documents, dividing the interior of each data document into a plurality of words, forming a corresponding relation between each word and the number of the data document by using an inverted index, and obtaining user screening information by retrieving and extracting the data documents.
4. The data filtering method based on big data according to claim 3, wherein the using the inverted index to make each word form a corresponding relation with the number of the data document, and obtaining the user filtering information by retrieving and extracting the data document, specifically comprises:
inverted indexing is carried out on the data document by adopting a hash table structure so as to obtain the corresponding relation from the word to all the data document numbers containing the word;
the screening conditions are disassembled into a plurality of words, and the numbers of all data documents containing the words corresponding to the screening conditions are inquired according to the corresponding relation;
and taking intersections of the numbers of all the queried data documents to obtain user screening information.
5. The method for screening data based on big data according to claim 4, wherein the inverted indexing of the data document using the hash table structure to obtain the correspondence from the word to all the data document numbers containing the word specifically comprises:
and sequentially accessing each data document, acquiring the value of each word in the data document in a hash table, and inserting the value in the data document number to determine the corresponding relation from the word to all the data document numbers containing the word.
6. The data screening device based on big data is characterized by comprising a data screening module, an information extraction module and a priority ordering module;
the data screening module is used for acquiring screening conditions, screening data to be screened according to the screening conditions and acquiring data documents corresponding to the screening conditions;
the information extraction module is used for extracting the data document by using the inverted index to obtain user screening information, and cleaning the user screening information to obtain cleaned user screening information;
the prioritization module is used for acquiring the relative frequency of the query condition of the pre-calibrated user in the data file in the cleaned user screening information according to the query condition of the pre-calibrated user and the sequencing characteristic function, and prioritizing the cleaned user screening information according to the relative frequency to obtain a prioritization result;
the sorting characteristic function is
Figure QLYQS_2
Wherein q is the query condition of the pre-calibrated user, d is the data document in the cleaned user screening information, f i (d, q) is a relative frequency of occurrence of the ith word in the query condition q of the pre-calibrated user in the data document d, f t (t i D) is the word t i The relative frequency of occurrence in the data document d is that V is the number of data documents selected according to the query condition of the pre-calibrated user, N is that part of the cleaned user screening information is selected as the number of training data documents, N t And screening the total number of the data documents in the information for the cleaned user.
7. An electronic device comprising a processor and a memory, wherein the memory has stored thereon a computer program which, when executed by the processor, implements the big data based data screening method according to any of claims 1-5.
8. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the big data based data screening method according to any of claims 1-5.
CN202110845992.8A 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment Active CN113641815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110845992.8A CN113641815B (en) 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110845992.8A CN113641815B (en) 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment

Publications (2)

Publication Number Publication Date
CN113641815A CN113641815A (en) 2021-11-12
CN113641815B true CN113641815B (en) 2023-06-13

Family

ID=78418374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110845992.8A Active CN113641815B (en) 2021-07-26 2021-07-26 Data screening method and device based on big data and electronic equipment

Country Status (1)

Country Link
CN (1) CN113641815B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341221A (en) * 2017-06-28 2017-11-10 百度在线网络技术(北京)有限公司 Foundation, associative search method, apparatus, equipment and the storage medium of index structure
CN112540986A (en) * 2020-12-07 2021-03-23 吴娟 Dynamic indexing method and system for quick combined query of big electric power data

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361042B (en) * 2014-10-29 2019-02-12 中国建设银行股份有限公司 A kind of information retrieval method and device
CN107491518B (en) * 2017-08-15 2020-08-04 北京百度网讯科技有限公司 Search recall method and device, server and storage medium
CN108680163B (en) * 2018-04-25 2022-03-01 武汉理工大学 Unmanned ship path searching system and method based on topological map
EP3906645A4 (en) * 2019-01-04 2022-06-01 Proofpoint, Inc. System and method for scalable file filtering using wildcards
CN110377558B (en) * 2019-06-14 2023-06-20 平安科技(深圳)有限公司 Document query method, device, computer equipment and storage medium
CN110727663A (en) * 2019-09-09 2020-01-24 光通天下网络科技股份有限公司 Data cleaning method, device, equipment and medium
CN111522905A (en) * 2020-04-15 2020-08-11 武汉灯塔之光科技有限公司 Document searching method and device based on database

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341221A (en) * 2017-06-28 2017-11-10 百度在线网络技术(北京)有限公司 Foundation, associative search method, apparatus, equipment and the storage medium of index structure
CN112540986A (en) * 2020-12-07 2021-03-23 吴娟 Dynamic indexing method and system for quick combined query of big electric power data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于文档层词频重排序的特征选择方法的研究与应用;张英杰;《知网》;第1-58页 *

Also Published As

Publication number Publication date
CN113641815A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
CN108629043B (en) Webpage target information extraction method, device and storage medium
CN112632385B (en) Course recommendation method, course recommendation device, computer equipment and medium
CN109766438A (en) Biographic information extracting method, device, computer equipment and storage medium
US8082264B2 (en) Automated scheme for identifying user intent in real-time
CN110532451A (en) Search method and device for policy text, storage medium, electronic device
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
US11100121B1 (en) Systems and methods for electronically mining intellectual property
CN110377558B (en) Document query method, device, computer equipment and storage medium
CN109299235B (en) Knowledge base searching method, device and computer readable storage medium
CN113204621B (en) Document warehouse-in and document retrieval method, device, equipment and storage medium
CN113407785B (en) Data processing method and system based on distributed storage system
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN112070550A (en) Keyword determination method, device and equipment based on search platform and storage medium
US11520835B2 (en) Learning system, learning method, and program
CN106407316B (en) Software question and answer recommendation method and device based on topic model
CN111612610A (en) Risk early warning method and system, electronic equipment and storage medium
CN110532229B (en) Evidence file retrieval method, device, computer equipment and storage medium
CN115238670A (en) Information text extraction method, device, equipment and storage medium
CN113886708A (en) Product recommendation method, device, equipment and storage medium based on user information
CN113641815B (en) Data screening method and device based on big data and electronic equipment
CN112990290A (en) Sample data generation method, device, equipment and storage medium
EP4270238A1 (en) Extracting content from freeform text samples into custom fields in a software application
CN115186188A (en) Product recommendation method, device and equipment based on behavior analysis and storage medium
CN114741608A (en) News recommendation method, device, equipment and storage medium based on user portrait

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant