CN107704620B

CN107704620B - Archive management method, device, equipment and storage medium

Info

Publication number: CN107704620B
Application number: CN201711021845.9A
Authority: CN
Inventors: 张立志; 万月亮; 王梅
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2020-12-01
Anticipated expiration: 2037-10-27
Also published as: CN107704620A

Abstract

The invention discloses a method, a device, equipment and a storage medium for file management. Acquiring original data of files, wherein each original data of the files contains two attribute information; after the archive original data are processed, screening the archive original data according to business requirements and forming archive graphic data as an archive; receiving input attribute information for querying; and searching information corresponding to the input attribute information for query in the archive to generate a query result. The technical scheme of the embodiment of the invention solves the problems that the prior art can not ensure that all attribute information is acquired and the query time is long due to the fact that multiple queries are required, achieves the effects of effectively managing the personnel file information and ensuring that all attribute information can be acquired quickly when the personnel file information is queried.

Description

Archive management method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to big data analysis and graph calculation technologies, in particular to a method, a device, equipment and a storage medium for archive management.

Background

The existing society is a society developing at a high speed, has developed technology and information circulation, and people can communicate with each other more and more closely and live more and more conveniently. Big data is the product of this high-tech age. These data contain a large amount of information. A staff profile is usually created for the business requirements, and the staff profile information is queried and analyzed in the created staff profile. The personnel file contains a large amount of related attribute information. In order to realize effective query of personnel archive information, professional management needs to be performed on the massive data.

Fig. 1 is a diagram illustrating a person profile data in the prior art. In the prior art, attribute information in a personnel file is discretely stored, original relevance of the attribute information is reserved, and no additional relevance relation is added. As shown in fig. 1, the attribute information of a person includes: attribute information a, attribute information B, attribute information C, attribute information D, attribute information E, attribute information F, attribute information G, and attribute information H. Wherein, the attribute information A is associated with attribute information B and attribute information C; the attribute information B is associated with the attribute information D and the attribute information E; attribute information C and attribute information F; attribute information E and attribute information G; attribute information G and attribute information H.

In a staff file in the prior art, in order to obtain all attribute information of a certain staff as shown in fig. 1, attribute information B and attribute information C need to be inquired through attribute information a; inquiring attribute information D, attribute information E and attribute information F through attribute information B and attribute information C; and inquiring attribute information G through the attribute information E, and inquiring attribute information H through the attribute information G. Under the conditions that data nodes of the personnel file are uncertain and the data volume is very large, the query mode cannot guarantee that all attribute information of a certain person is obtained. Meanwhile, the query time is long due to the fact that multiple queries need to be carried out.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, a device and a storage medium for managing a document, so as to achieve effective management of personnel document information and ensure that all attribute information can be quickly acquired when querying the personnel document information.

In a first aspect, an embodiment of the present invention provides a method for archive management, including:

acquiring original data of files, wherein each original data of the files contains two attribute information;

after the archive original data are processed, screening the archive original data according to business requirements and forming archive graphic data as an archive;

receiving input attribute information for querying;

and searching information corresponding to the input attribute information for query in the archive to generate a query result.

Further, after the archive original data are processed, the archive original data are screened according to business requirements to form archive graphic data, and an archive is generated, including:

removing repeated data in the original data of the file;

selecting a bridge point according to service requirements, wherein the bridge point is an attribute information type with high service value;

screening the original archive data according to the bridge points to screen out single-bridge data and double-bridge data;

and forming archive graphic data as an archive according to the bridge points, the single bridge data and the double bridge data.

Furthermore, the single-bridge data is archive original data in which only one attribute information of the two attribute information is a bridge point;

the double-bridge data is archive original data containing two pieces of attribute information, namely bridge points.

Further, the generating of the archive by forming the archive graphic data according to the bridge point, the single bridge data and the double bridge data includes:

extracting all attribute information belonging to the bridge points as vertexes of the archive graphic data;

and forming edges of the archive graphic data according to the single bridge data and the double bridge data, and connecting the attribute information which is contained in the single bridge data and does not belong to the bridge point to the corresponding vertex.

Further, the searching for the information corresponding to the input attribute information for query in the archive to generate a query result includes:

acquiring archive graphic data corresponding to the input attribute information for query according to the input attribute information for query;

and generating a query result according to the attribute information in the archive graphic data.

In a second aspect, an embodiment of the present invention further provides an apparatus for archive management, including:

the data acquisition module is used for acquiring original file data, wherein each original file data comprises two pieces of attribute information;

the data processing module is used for screening the file original data according to the service requirement and forming file graphic data as a file after processing the file original data;

the query information input module is used for receiving input attribute information for query;

and the query result generation module is used for searching the information corresponding to the input attribute information for querying in the file and generating a query result.

Further, the data processing module comprises:

the data deduplication unit is used for removing duplicate data in the original data of the file;

the bridge point selection unit is used for selecting bridge points according to service requirements, and the bridge points are attribute information types with high service values;

the data screening unit is used for screening the original archive data according to the bridge points to screen out single-bridge data and double-bridge data;

and the graph data generation unit is used for forming archive graph data as an archive according to the bridge point, the single bridge data and the double bridge data.

In a third aspect, an embodiment of the present invention further provides an apparatus for file management, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the file management method according to the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for file management according to the embodiment of the present invention.

The method, the device, the equipment and the storage medium for managing the archives are realized by acquiring the original data of the archives; after the archive original data are processed, screening the archive original data according to business requirements and forming archive graphic data as an archive; receiving input attribute information for querying; the method and the device have the advantages that information corresponding to the input attribute information for inquiry is searched in the file, an inquiry result is generated, the problems that the prior art cannot guarantee that all attribute information is obtained, and the inquiry time is long due to the fact that multiple times of inquiry are needed are solved, effective management of personnel file information is achieved, and the effect that all attribute information can be obtained quickly when the personnel file information is inquired is guaranteed.

Drawings

FIG. 1 is a diagram illustrating a person profile data in the prior art;

FIG. 2 is a flowchart illustrating a file management method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a file management method according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a file management apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a file management apparatus according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 2 is a flowchart of a method for managing a file according to an embodiment of the present invention, where the method is applicable to a case of managing a file, and the method can be executed by a device for managing a file, where the device is executed by software and/or hardware, and can be generally integrated into a device for data synchronization. Devices for data synchronization include, but are not limited to, computers and the like. Referring to fig. 2, it specifically includes the following steps:

step 110, obtaining original data of the archives, wherein each original data of the archives contains two attribute information.

The archive original data is associated data used for generating the archive. The profile raw data may come from a social network, an e-commerce site, or a customer visit record, among many other sources. And acquiring the original data of the file from the source by using an original data acquisition method of the file. The method for acquiring the original data of the file comprises the following steps: the system log collection method, the network data collection method, the data collection method and the like are carried out in a relevant mode of using a specific system interface and the like through cooperation with enterprises or research institutions. Because the data volume of the archive original data is large, the massive archive original data collected according to the service requirements are usually recorded into a big data platform for storage and data processing. The big data platform is a platform for storing, calculating and displaying big data, integrates functions of data integration, data processing, data storage, data analysis, visualization and the like, is used for mining business logic behind the data, finding problems behind the data and adjusting in time.

Specifically, in this embodiment, the Distributed computing platform Hadoop is used as a big data platform, the archive raw data is entered into a Distributed File System (HDFS) of the Distributed computing platform Hadoop, a File read instruction textFile of an elastic Distributed data set (RDD) that can be seamlessly integrated into a compute engine Spark in the Distributed computing platform Hadoop is used to read the archive raw data stored in the HDFS into the RDD, and the archive raw data is processed by the compute engine Spark.

The distributed computing platform Hadoop is a distributed computing platform which can be easily constructed and used by users. The data are stored in the HDFS of the Hadoop distributed computing platform, and a user can easily develop and run an application program for processing mass data on the Hadoop distributed computing platform. The distributed computing platform Hadoop has the advantages of high reliability, high efficiency, high fault tolerance, low cost and the like.

The computing engine Spark is a fast and general computing engine designed for large-scale data processing, can be seamlessly integrated into a distributed computing platform Hadoop platform, and completes various operations by operating RDD, including Structured Query Language (SQL) Query, text processing, graph computing, and machine learning. The computing engine Spark has the characteristics of high efficiency, usability, universality and the like.

And step 120, after the archive original data are processed, screening the archive original data according to business requirements and forming archive graphic data as an archive.

In actual business, repeated data often appears in the original data of the file, and deduplication is a step of processing the original data of the file. Specifically, the file original data is processed, and the duplicate data in the file original data is removed by using the deduplication instruction distint of the RDD in the computing engine Spark.

The data volume of the original archive data is huge, the original archive data is screened according to the business requirements, the original archive data with high business value is reserved, the original archive data with low business value is filtered, and the operation efficiency is improved on the premise of meeting the business requirements. Specifically, after removing the repeated data in the original archive data, selecting the attribute information type with high business value, and screening the original archive data. If at least one attribute information in the two attribute information contained in one file original data belongs to the attribute information type with high service value, performing the next data processing on the file original data; and if none of the two attribute information contained in one file original data belongs to the attribute information type with high service value, the next data processing is not carried out on the file original data.

And (4) by utilizing graph calculation, according to the relevance of the original archive data, associating the discrete original archive data to form corresponding archive graphic data as an archive. Graph computation is an abstract representation of a "graph" structure of the real world, based on "graph theory," and the mode of computation on this data structure. In general, in graph computation, the basic data structure expression is:

G＝(V，E，D)

v ═ vertex (vertex or node)

E ═ edge)

D ═ data (weight).

The data structure of the graph computation is composed of vertices and edges. The vertex contains a vertex attribute. Edges contain weights and directions, i.e., edges contain data associations between vertices. The edges connect the associated vertices according to the relevance of the data to form a data structure of the graph calculation. The graph data structure expresses the relevance between data well. Relevance computation is the core of big data computation. By obtaining the relevance of the data, useful information can be extracted from the mass data.

Specifically, the attribute information of the screened archive original data is extracted by using a calculation engine Spark as a vertex, and the vertices are connected according to the incidence relation of the attribute information of the archive original data to form archive graphic data with a graph calculation data structure as an archive. And saving the archive graphic data through a saving instruction save of the RDD in the calculation engine Spark. The vertexes of the archive graphic data are connected into a whole through edges, and the vertexes contain corresponding attribute information, namely the attribute information in the archive graphic data is connected into a whole.

Step 130, receiving the input attribute information for querying.

Wherein, the archive graphic data in the archive all contain corresponding attribute information. When information is inquired in the archive, all attribute information related to the input attribute information for inquiry in the archive can be inquired by inputting the attribute information for inquiry according to inquiry requirements.

Step 140, searching the information corresponding to the input attribute information for query in the archive, and generating a query result.

The attribute information in the archive graphic data is connected into a whole, and all the attribute information related to the input attribute information for query can be acquired by acquiring the archive graphic data corresponding to the input attribute information for query through one-time query. And searching archive graphic data corresponding to the input attribute information for query in the archive according to the input attribute information for query to generate a query result. The query result contains all attribute information in the acquired archive graphic data.

In the method for managing a file provided by this embodiment, original data of a file is obtained; after the archive original data are processed, screening the archive original data according to business requirements and forming archive graphic data as an archive; receiving input attribute information for querying; the method and the device have the advantages that information corresponding to the input attribute information for inquiry is searched in the file, an inquiry result is generated, the problems that the prior art cannot guarantee that all attribute information is obtained, and the inquiry time is long due to the fact that multiple times of inquiry are needed are solved, effective management of personnel file information is achieved, and the effect that all attribute information can be obtained quickly when the personnel file information is inquired is guaranteed.

Example two

Fig. 3 is a flowchart of a file management method according to a second embodiment of the present invention, which is embodied on the basis of the foregoing embodiments. As shown in fig. 3, the method specifically includes:

step 210, obtaining original data of files, wherein each original data of files contains two attribute information.

And step 220, removing the repeated data in the original data of the file.

The method comprises the steps of processing the original data of the file, and removing repeated data in the original data of the file by using a deduplication instruction distint of an RDD in a computing engine Spark.

And step 230, selecting a bridge point according to the service requirement, wherein the bridge point is an attribute information type with high service value.

And selecting the attribute information type with high service value as a bridge point in the de-duplicated archive original data. The attribute information not belonging to the bridge point is a non-bridge point.

And 240, screening the original archive data according to the bridge points to screen out single-bridge data and double-bridge data.

The single bridge data is archive original data of which only one attribute information is a bridge point in two attribute information; the double-bridge data is archive original data containing two pieces of attribute information, namely bridge points.

And screening the original data of the archives according to the selected bridge points, screening out single-bridge data and double-bridge data, and performing the next data processing. If only one of two attribute information contained in one file original data is a bridge point, the file original data is single-bridge data; if two pieces of attribute information contained in one piece of file original data are both bridge points, the piece of file original data is double-bridge data; if none of the two pieces of attribute information contained in one piece of file original data belongs to a bridge point, that is, both pieces of attribute information contained in the piece of file original data are non-bridge points, the next data processing is not performed on the piece of file original data. Therefore, the archive original data with high business value is reserved, the archive original data with low business value is filtered, and the operation efficiency is improved on the premise of meeting the business requirement.

And 250, forming archive graphic data as an archive according to the bridge points, the single bridge data and the double bridge data.

And then, connecting the attribute information in the single bridge data and the attribute information in the double bridge data into a whole according to the incidence relation of the attribute information in the single bridge data and the double bridge data to form archive graphic data as an archive.

Preferably, the generating a file by forming the file graph data according to the bridge point, the single bridge data and the double bridge data includes:

Specifically, all attribute information belonging to a bridge point in the single-bridge data and the double-bridge data is extracted as a vertex of the archive graphic data. Edges of the archival graphic data are then formed from the single bridge data and the double bridge data. Since both attribute information in the dual-bridge data are bridge points, that is, the dual-bridge data contains two vertices. Therefore, the edges of the archive graphic data are formed according to the incidence relation of the attribute information in the double-bridge data, and the vertexes can be connected with the vertexes. Because only one of the two pieces of attribute information contained in the single-bridge data is the bridge point, and the other one is the attribute information not belonging to the bridge point, the single-bridge data contains a vertex and a non-bridge point. Therefore, the edge of the archive graph data is formed according to the incidence relation of the attribute information in the single-bridge data, and the attribute information which is contained in the single-bridge data and does not belong to the bridge point can be connected to the corresponding vertex.

Step 260, receiving the input attribute information for querying.

Specifically, a network page for inputting attribute information is provided for the user to input the attribute information for querying, and the user experience is improved. The user inputs certain attribute information, and receives the input attribute information for query after confirming that the query is started.

And 270, acquiring archive graphic data corresponding to the input attribute information for query according to the input attribute information for query.

The attribute information in the archive graphic data is connected into a whole, and all the attribute information related to the input attribute information for query can be acquired by acquiring the archive graphic data corresponding to the input attribute information for query through one-time query.

And step 280, generating a query result according to the attribute information in the archive graphic data.

And the query result comprises all attribute information in the acquired archive graphic data. Specifically, the query result is displayed through a web page for the user to view.

In the method for managing a file provided by this embodiment, original data of a file is obtained; removing repeated data in the original data of the file; selecting a bridge point according to the service requirement; screening the original archive data according to the bridge points to screen out single-bridge data and double-bridge data; then, forming archive graphic data as an archive according to the bridge points, the single bridge data and the double bridge data; receiving input attribute information for querying; acquiring archive graphic data corresponding to the input attribute information for query according to the input attribute information for query; the inquiry result is generated, the problems that the prior art cannot guarantee that all attribute information is acquired, and the inquiry time is long due to the fact that multiple times of inquiry are needed are solved, effective management of personnel archive information is achieved, and the effect that all attribute information can be acquired quickly when the personnel archive information is inquired is guaranteed.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a file management apparatus according to a third embodiment of the present invention, which is applicable to a case of synchronizing data, and as shown in fig. 4, the apparatus includes:

a data acquisition module 310, a data processing module 320, a query information input module 330, and a query result generation module 340.

The data obtaining module 310 is configured to obtain archive original data, where each archive original data includes two pieces of attribute information; the data processing module 320 is configured to, after processing the archive original data, screen the archive original data according to a service requirement, form archive graphic data, and generate an archive; a query information input module 330 for receiving input attribute information for a query; and the query result generating module 340 is configured to search the archive for information corresponding to the input attribute information for querying, and generate a query result.

The device for file management provided by the embodiment acquires the original data of the file; after the archive original data are processed, screening the archive original data according to business requirements and forming archive graphic data as an archive; receiving input attribute information for querying; the method and the device have the advantages that information corresponding to the input attribute information for inquiry is searched in the file, an inquiry result is generated, the problems that the prior art cannot guarantee that all attribute information is obtained, and the inquiry time is long due to the fact that multiple times of inquiry are needed are solved, effective management of personnel file information is achieved, and the effect that all attribute information can be obtained quickly when the personnel file information is inquired is guaranteed.

On the basis of the foregoing embodiments, the data processing module 320 may include:

the system comprises a bridge point selection unit, a service processing unit and a service processing unit, wherein the bridge point selection unit is used for selecting a bridge point according to service requirements, and the bridge point is an attribute information type with higher service value;

and the graph data generation unit is used for forming archive graph data according to the bridge points, the single bridge data and the double bridge data to generate an archive.

On the basis of the above embodiments, the single-bridge data may be archive original data in which only one attribute information of two attribute information is a bridge point;

the double-bridge data can be archive original data containing two pieces of attribute information, namely bridge points.

On the basis of the foregoing embodiments, the graphics data generating unit may include:

the vertex generation subunit is used for extracting all the attribute information belonging to the bridge points as the vertexes of the archive graphic data;

and the graph connection subunit is used for forming the edges of the archive graph data according to the single-bridge data and the double-bridge data and connecting the attribute information which is contained in the single-bridge data and does not belong to the bridge point to the corresponding vertex.

On the basis of the foregoing embodiments, the query result generating module 340 may include:

the data acquisition subunit is used for acquiring archive graphic data corresponding to the input attribute information for inquiry according to the input attribute information for inquiry;

and the result generating subunit is used for generating a query result according to the attribute information in the archive graphic data.

The device can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 5 is a schematic structural diagram of an archive management apparatus according to a fourth embodiment of the present invention, as shown in fig. 5, the apparatus includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of processors 410 in the device may be one or more, and one processor 410 is taken as an example in fig. 5; the device processor 410, memory 420, input device 430, and output device 440 may be connected by a bus or other means, such as by a bus connection in fig. 5.

The memory 420 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the method of archive management in the embodiment of the present invention (e.g., the data acquisition module 310, the data processing module 320, the query information input module 330, and the query result generation module 340 in the apparatus for archive management). The processor 410 executes various functional applications of the device and data processing by executing software programs, instructions and modules stored in the memory 410, so as to realize the above-mentioned archive management method.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 can be used for receiving the archive raw data inputted from the outside. The output device 440 may be used to output the query results and display the query results.

The file management device can execute the method provided by any embodiment of the invention and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method of archive management, the method including:

receiving input attribute information for querying;

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the method for file management provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the file management apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of archive management, comprising:

receiving input attribute information for querying;

searching information corresponding to the input attribute information for query in a file to generate a query result;

after the archive original data are processed, screening the archive original data according to business requirements and forming archive graphic data to generate an archive, comprising the following steps of:

removing repeated data in the original data of the file;

screening the archive original data according to the bridge point to screen out single-bridge data and double-bridge data, wherein the single-bridge data is the archive original data of which only one attribute information is the bridge point in the two attribute information; the double-bridge data is archive original data which contains two pieces of attribute information, namely bridge points;

2. The method of claim 1, wherein forming the archive graphical data from the bridge points, the single bridge data, and the double bridge data, generating the archive comprises:

3. The method of claim 1, wherein the searching for information in the archive corresponding to the input attribute information for querying, and generating a query result comprises:

4. An archive management apparatus, comprising:

the query result generation module is used for searching the information corresponding to the input attribute information for querying in the file and generating a query result;

the data processing module comprises:

the data screening unit is used for screening the archive original data according to the bridge point to screen single-bridge data and double-bridge data, wherein the single-bridge data is the archive original data of which only one attribute information is the bridge point in the two pieces of attribute information; the double-bridge data is archive original data which contains two pieces of attribute information, namely bridge points;

5. An archive management device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements the archive management method according to any of claims 1-3.

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of archive management according to any of claims 1-3.