CN110222249B

CN110222249B - Data storage method, device, computer equipment and storage medium

Info

Publication number: CN110222249B
Application number: CN201910370895.0A
Authority: CN
Inventors: 吴壮伟
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-06
Filing date: 2019-05-06
Publication date: 2024-05-31
Anticipated expiration: 2039-05-06
Also published as: CN110222249A

Abstract

The invention discloses a data storage method, a data storage device, computer equipment and a storage medium. The method comprises the following steps: crawling a corresponding source data set according to the received enterprise name, and preprocessing the source data set to obtain processed data corresponding to a plurality of text data one by one; performing de-duplication on the corresponding processed data to obtain a corresponding de-duplicated data set; the enterprise history information lists included in the duplicate-removed data set are obtained, and each enterprise information list is correspondingly added to a time node corresponding to a time axis or a calendar according to the occurrence time of the enterprise event in the included text and the corresponding enterprise history information list, so that development track data corresponding to the enterprise name is obtained; the development track data is displayed through a time axis or calendar. The method adopts a data crawling technology, realizes automatic crawling of the relevant information of the enterprise from the network, and sorts all the key information according to time after extracting the key information of the relevant information of the enterprise, so as to obtain the development track data of the enterprise.

Description

Data storage method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data acquisition technologies, and in particular, to a data storage method, a data storage device, a computer device, and a storage medium.

Background

At present, large event information of enterprises has dispersibility in the time dimension, and is difficult to collect and intensively arrange. In addition, the data arrangement of each enterprise in the development of the enterprise is generally based on manual text operation, and a tool for automatic acquisition is absent. Moreover, each enterprise generally organizes its own enterprise history data, and does not form an enterprise history data query platform capable of viewing history events of all enterprises in a centralized manner.

Disclosure of Invention

The embodiment of the invention provides a data storage method, a data storage device, computer equipment and a data storage medium, and aims to solve the problems that in the prior art, enterprises are in partial manual text operation, and lack of automatic acquisition and data arrangement results in low efficiency in data arrangement of the enterprises.

In a first aspect, an embodiment of the present invention provides a data storage method, including:

Crawling a corresponding source data set according to the received enterprise name, and preprocessing the source data set to obtain processed data corresponding to a plurality of text data contained in the source data set one by one; wherein each processed data corresponds to an enterprise history information list;

Performing duplication removal on the processed data which are contained in the source data and correspond to the text data one by one to obtain a duplication-removed data set corresponding to the source data; and

Acquiring enterprise history information lists included in the duplicate-removed data set, and correspondingly adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list to obtain development track data corresponding to an enterprise name; the names of the time axis or the calendars are generated by the enterprise names corresponding to the enterprise information lists, and the development track data are data in the form of the time axis or the calendars added with the enterprise history information list;

And displaying the development track data through a time axis or a calendar.

In a second aspect, an embodiment of the present invention provides a data storage device, including:

the source data crawling unit is used for crawling a corresponding source data set according to the received enterprise name, preprocessing the source data set, and obtaining processed data corresponding to a plurality of text data included in the source data set one by one; wherein each processed data corresponds to an enterprise history information list;

the de-duplication unit is used for de-duplication the processed data which are contained in the source data and correspond to the text data one by one to obtain a de-duplicated data set which corresponds to the source data; and

The track data acquisition unit is used for acquiring enterprise history information lists included in the duplicate-removed data set, and correspondingly adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list to obtain development track data corresponding to an enterprise name; the names of the time axis or the calendars are generated by the enterprise names corresponding to the enterprise information lists, and the development track data are data in the form of the time axis or the calendars added with the enterprise history information list;

And the data display unit is used for displaying the development track data through a time axis or a calendar.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the data storage method described in the first aspect when the processor executes the computer program.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor causes the processor to perform the data storage method according to the first aspect.

The embodiment of the invention provides a data storage method, a data storage device, computer equipment and a storage medium. The method comprises the steps of crawling a corresponding source data set according to a received enterprise name, and preprocessing the source data set to obtain processed data corresponding to a plurality of text data contained in the source data set one by one; wherein each processed data corresponds to an enterprise history information list; performing duplication removal on the processed data which are contained in the source data and correspond to the text data one by one to obtain a duplication-removed data set corresponding to the source data; the enterprise history information lists included in the duplicate removal data set are obtained, and the enterprise information lists are correspondingly added to time nodes corresponding to time shafts or calendars according to the occurrence time of enterprise events in the included text and the corresponding enterprise history information lists, so that development track data corresponding to enterprise names are obtained; and displaying the development track data through a time axis or a calendar. The method adopts a data crawling technology, realizes automatic crawling of the relevant information of the enterprise from the network, and sorts all the key information according to time after extracting the key information of the relevant information of the enterprise, so as to obtain the development track data of the enterprise.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a data storage method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a data storage method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another flow chart of a data storage method according to an embodiment of the present invention;

FIG. 4 is a schematic sub-flowchart of a data storage method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another sub-flowchart of a data storage method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another sub-flowchart of a data storage method according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a data storage device provided by an embodiment of the present invention;

FIG. 8 is another schematic block diagram of a data storage device provided in an embodiment of the present invention;

FIG. 9 is a schematic block diagram of a subunit of a data storage device according to an embodiment of the present invention;

FIG. 10 is a schematic block diagram of another subunit of a data storage device according to an embodiment of the present invention;

FIG. 11 is a schematic block diagram of another subunit of a data storage device according to an embodiment of the present invention;

Fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic application scenario diagram of a data storage method according to an embodiment of the present invention, and fig. 2 is a flowchart of the data storage method according to an embodiment of the present invention, where the data storage method is applied to a server, and the method is executed by application software installed in the server.

As shown in fig. 2, the method includes steps S110 to S140.

S110, crawling a corresponding source data set according to the received enterprise name, and preprocessing the source data set to obtain processed data corresponding to a plurality of text data contained in the source data set one by one; wherein each processed data corresponds to a list of enterprise history information.

In this embodiment, in order to obtain development tracks of one or more enterprises, a crawler tool may crawl a large amount of web page content related to the corresponding enterprises. For example, when the web page content related to the enterprise a needs to be acquired, a to-be-crawled web page list related to the enterprise a is set for a crawler tool deployed in a server, and the web page content related to the enterprise a is acquired from a web page server of the to-be-crawled web page list web page through the crawler tool as a source data set. And then preprocessing the source data set to obtain historical text data related to the development process of the enterprise. Specifically, each piece of text data crawled from the web page can be subjected to extraction of a title keyword list and a text keyword list.

In one embodiment, as shown in fig. 4, step S110 includes:

s111, extracting keyword information from the title and the text of each text data in a plurality of text data included in the source data set through a word frequency-inverse text frequency index model respectively to obtain a title keyword list and a text keyword list corresponding to each text data;

S112, identifying the publishing time of each text data and the occurrence time of enterprise events in a plurality of text data included in the source data set to obtain text key time information corresponding to each text data;

And S113, packaging and combining a title keyword list, a text keyword list and text key time information corresponding to each text data in the source data set to obtain an enterprise history information list corresponding to each text data so as to obtain processed data corresponding to each source data.

In this embodiment, when keyword information extraction is performed on the title and the body of each text data in the plurality of text data included in the source data set by using a TF-IDF model (i.e., term frequency-inverse document frequency, which represents a word frequency-inverse text frequency index model), each text data may be simplified and expressed as a corresponding keyword list.

And then, when the publishing time of each text data and the occurrence time of the enterprise event in the plurality of text data included in the source data set are identified, the publishing time of the text data is the publishing time of the text below the title of the scanned text (the publishing time of the news is below the news title on a general network), the identification process of the occurrence time of the enterprise event is the text content of the scanned text data, whether the text content of the time points such as X5 minutes when X1 year X2 month X3 day X4 are included or not is judged, and if yes, the text content is identified to obtain the occurrence time of the enterprise event.

Since each text data is simplified to be a combination of the title keyword list, the text keyword list and the text key time information, each text data corresponding to each source data can be simplified to be represented by the enterprise history information list, that is, the processed data corresponding to each source data can be represented by the corresponding enterprise history information list.

And S120, performing de-duplication on the processed data which are contained in the source data and correspond to the text data one by one, so as to obtain a de-duplicated data set corresponding to the source data.

In this embodiment, the repeated data is determined, so that it is possible that the crawler crawls news contents of a plurality of web pages, which are all events describing the same enterprise, but the news contents are simultaneously reloaded by a plurality of websites; the same webpage content is subjected to pretreatment, and the obtained treated data are the same, so that a plurality of the same treated data only need to be kept for one item and the rest are subjected to duplicate removal and deletion, and repeated enterprise event information can be effectively prevented from being obtained.

In one embodiment, as shown in fig. 5, step S120 includes:

S121, acquiring each piece of processed data, and judging whether repeated processed data exists in the processed data corresponding to a plurality of pieces of text data included in a source data set one by one according to a title keyword list and a text keyword list included in each piece of processed data;

And S122, if the processed data corresponding to the text data included in the source data set one by one has repeated processed data, arbitrarily selecting one group of the processed data to store and delete the processed data which is not selected to store in the plurality of the same processed data, and obtaining the de-duplicated data set corresponding to the source data.

In this embodiment, if the processed data corresponding to the text data one by one in the source data set includes repeated processed data, it indicates that the crawler crawls news content of a plurality of web pages, which are all the same enterprise event describing the enterprise, only the news content of one of the web pages needs to be saved at this time, and other repeated redundant processed data are deleted, so as to obtain the duplicate removed data set corresponding to the source data. And if the processed data which is contained in the source data set and corresponds to the text data one by one does not contain repeated processed data, reserving the source data set as a duplicate-removed data set. Through the de-preprocessing, the size of the data set is effectively reduced.

S130, acquiring enterprise history information lists included in the de-duplicated data set, and correspondingly adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list to obtain development track data corresponding to an enterprise name; wherein the names of the time axis or the calendar are generated by the enterprise names corresponding to the enterprise information lists, and the development track data is data in the form of the time axis or the calendar added with the enterprise history information list.

In this embodiment, because duplicate enterprise history information lists are deleted from the deduplicated dataset, and each enterprise history information list includes enterprise event occurrence times, each enterprise history information list may be ordered according to the enterprise event occurrence times, and each enterprise history information list may be added to a corresponding time node according to the enterprise event occurrence times, so as to obtain development track data of an enterprise.

Specifically, a time axis or a calendar with names is generated by the enterprise names corresponding to the enterprise information lists, then each enterprise history information list is added to a time node corresponding to the time axis or the calendar according to the occurrence time of the enterprise event, only a highlight mark is carried out on the time node corresponding to the time axis or the calendar, and one or more enterprise history information lists corresponding to the time node on the same day are correspondingly popped up when the time node is clicked.

In one embodiment, as shown in fig. 6, step S130 includes:

s131, sorting the enterprise history information lists in a descending order according to the occurrence time of the enterprise events included in the enterprise information lists to obtain a sorted enterprise history information list set;

And S132, sequentially adding each enterprise event occurrence time in the ordered enterprise history information list set to a time node corresponding to a preset time axis or calendar according to the descending order of the enterprise event occurrence time so as to obtain development track data corresponding to the enterprise name.

In this embodiment, the enterprise history information lists may be ordered according to occurrence time of the enterprise event, and each enterprise history information list is added to a time axis or a date corresponding to a calendar according to occurrence time of the enterprise event, so as to obtain development track data using the time axis or the calendar as a carrier, and a date may be selected to view the enterprise history information list and specific text content thereof.

And S140, displaying the development track data through a time axis or a calendar.

In this embodiment, after each enterprise history information list is added to a date corresponding to a time axis or a calendar according to the occurrence time of an enterprise event, the development track data is displayed through the time axis or the calendar. In order to save display space, generally, only a time axis named by an enterprise name or a calendar is initially displayed, a time node highlight identifier of an enterprise history information list is added, and after a user clicks the time node, a corresponding enterprise history information list is displayed.

In one embodiment, as shown in fig. 3, step S140 further includes:

And S150, if the entered search keywords are detected, comparing the search keywords with an enterprise historical information list, and acquiring an enterprise historical information list with the pearson similarity with the search keywords exceeding a preset similarity threshold value to serve as a search pushing result.

In this embodiment, after preprocessing and deduplication processing are performed on each source data in the source data set, an enterprise history information list corresponding to each text data one by one is obtained, and at this time, the enterprise history information lists are mapped to corresponding time nodes in chronological order to form development track data of an enterprise. Because each enterprise history information list comprises a title keyword list, a text keyword list and text keyword time information, the input search keywords can be compared with the keywords in each enterprise history information list, and the enterprise history information list which is similar to the search keywords in the search keywords is screened to be used as a search pushing result.

Specifically, an enterprise history information list with the pearson similarity with the search keyword exceeding a preset similarity threshold is obtained to serve as a search pushing result. When the pearson similarity between each keyword in the enterprise history information list and the search keyword is calculated, each keyword in the enterprise history information list and the search keyword can be converted into a Word vector through a Word2Vec model (the Word2Vec model is a model for converting words into Word vectors), and then the pearson similarity between the Word vector corresponding to each keyword and the Word vector corresponding to the search keyword is calculated.

The pearson correlation coefficient between any two vectors is calculated by the following formula:

wherein E represents a mathematical expectation;

the range of values for ρ _X,Y is (0, 1), with ρ _X,Y being closer to 1 indicating a higher degree of similarity for the two column vectors and ρ _X,Y being closer to 0 indicating a lower degree of similarity for the two vectors.

The method realizes that the enterprise related information is automatically crawled from the network, and after the key information of the enterprise related information is extracted, all the key information is ordered according to time, so that the development track data of the enterprise is obtained.

The embodiment of the invention also provides a data storage device which is used for executing any embodiment of the data storage method. In particular, referring to fig. 7, fig. 7 is a schematic block diagram of a data storage device according to an embodiment of the present invention. The data storage device 100 may be configured in a server.

As shown in fig. 7, the data storage device 100 includes a source data crawling unit 110, a deduplication unit 120, a trajectory data acquisition unit 130, and a data display unit 140.

A source data crawling unit 110, configured to crawl a corresponding source data set according to the received enterprise name, and pre-process the source data set to obtain processed data corresponding to a plurality of text data included in the source data set one-to-one; wherein each processed data corresponds to a list of enterprise history information.

In this embodiment, for the development track of one or more enterprises, a crawler tool may crawl a large amount of web content related to the corresponding enterprise. For example, when the web page content related to the enterprise a needs to be acquired, a to-be-crawled web page list related to the enterprise a is set for a crawler tool deployed in a server, and the web page content related to the enterprise a is acquired from a web page server of the to-be-crawled web page list web page through the crawler tool as a source data set. And then preprocessing the source data set to obtain historical text data related to the development process of the enterprise. Specifically, each piece of text data crawled from the web page can be subjected to extraction of a title keyword list and a text keyword list.

In one embodiment, as shown in FIG. 9, the source data crawling unit 110 comprises:

A keyword extraction unit 111, configured to extract keyword information from a title and a text of each text data in the plurality of text data included in the source data set through a word frequency-inverse text frequency index model, to obtain a title keyword list and a text keyword list corresponding to each text data;

a time information identifying unit 112, configured to identify a publishing time of each text data in the plurality of text data included in the source data set and an occurrence time of an enterprise event, so as to obtain text key time information corresponding to each text data;

And the data combination unit 113 is configured to encapsulate and combine the title keyword list, the text keyword list, and the text key time information corresponding to each text data in the source data set, so as to obtain an enterprise history information list corresponding to each text data, so as to obtain processed data corresponding to each source data.

And the deduplication unit 120 is configured to deduplicate the processed data that corresponds to the plurality of text data included in the source data one by one, so as to obtain a deduplicated dataset that corresponds to the source data.

In one embodiment, as shown in fig. 10, the deduplication unit 120 includes:

A data comparison unit 121, configured to obtain each piece of processed data, and determine whether there is repeated processed data corresponding to the plurality of pieces of text data included in the source data set one by one according to the title keyword list and the text keyword list included in each piece of processed data;

the data deleting unit 122 is configured to, if there is duplicate processed data in the processed data corresponding to the plurality of text data included in the source data set, store and delete any selected group of the plurality of identical processed data, and obtain a de-duplicated data set corresponding to the source data.

In this embodiment, if the processed data corresponding to the text data one by one in the source data set includes repeated processed data, it indicates that the crawler crawls news content of a plurality of web pages, which are all the same enterprise event describing the enterprise, only the news content of one of the web pages needs to be saved at this time, and other repeated redundant processed data are deleted, so as to obtain the duplicate removed data set corresponding to the source data. Through the de-preprocessing, the size of the data set is effectively reduced.

The track data obtaining unit 130 is configured to obtain an enterprise history information list included in the deduplicated dataset, and correspondingly add each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list, so as to obtain development track data corresponding to an enterprise name; wherein the names of the time axis or the calendar are generated by the enterprise names corresponding to the enterprise information lists, and the development track data is data in the form of the time axis or the calendar added with the enterprise history information list.

In one embodiment, as shown in fig. 11, the trajectory data acquisition unit 130 includes:

A descending order sorting unit 131, configured to sort the enterprise history information lists in descending order according to the occurrence time of the enterprise event included in each enterprise information list, so as to obtain a sorted enterprise history information list set;

the track data mapping unit 132 is configured to sequentially add each of the ordered enterprise history information list sets to a time node corresponding to a preset time axis or calendar in descending order of occurrence time of enterprise events, so as to obtain development track data corresponding to the enterprise name.

And a data display unit 140, configured to display the development track data through a time axis or a calendar.

In one embodiment, as shown in FIG. 8, the data storage device 100 further comprises:

And the search unit 150 is configured to compare the entered search keyword with the enterprise history information list if the entered search keyword is detected, and obtain an enterprise history information list with pearson similarity with the search keyword exceeding a preset similarity threshold, as a search pushing result.

The device realizes that the enterprise related information is automatically crawled from the network, and after the key information of the enterprise related information is extracted, all the key information is ordered according to time, so that the development track data of the enterprise is obtained.

The data storage means described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 12.

Referring to fig. 12, fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be a stand-alone server or a server cluster formed by a plurality of servers.

With reference to FIG. 12, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a data storage method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a data storage method.

The network interface 505 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 500 to which the present inventive arrangements may be applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to perform the following functions: crawling a corresponding source data set according to the received enterprise name, and preprocessing the source data set to obtain processed data corresponding to a plurality of text data contained in the source data set one by one; wherein each processed data corresponds to an enterprise history information list; performing duplication removal on the processed data which are contained in the source data and correspond to the text data one by one to obtain a duplication-removed data set corresponding to the source data; acquiring enterprise history information lists included in the duplicate-removed data set, and correspondingly adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list to obtain development track data corresponding to an enterprise name; the names of the time axis or the calendars are generated by the enterprise names corresponding to the enterprise information lists, and the development track data are data in the form of the time axis or the calendars added with the enterprise history information list; and displaying the development track data through a time axis or a calendar.

In an embodiment, after executing the obtaining the enterprise history information list included in the de-duplicated data set, the processor 502 adds each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of the enterprise event in the included text and the corresponding enterprise history information list, so as to obtain development track data corresponding to the enterprise name; after the step of displaying the development track data through a time axis or a calendar, the following operations are further performed: if the entered search keywords are detected, comparing the search keywords with an enterprise historical information list, and acquiring an enterprise historical information list with the pearson similarity with the search keywords exceeding a preset similarity threshold value to serve as a search pushing result.

In one embodiment, when the step of preprocessing the source data set to obtain processed data corresponding to a plurality of text data included in the source data set is performed by the processor 502, the following operations are performed: extracting keyword information from the titles and texts of each text data in the plurality of text data included in the source data set through a word frequency-inverse text frequency index model respectively to obtain a title keyword list and a text keyword list corresponding to each text data; identifying the publishing time of each text data and the occurrence time of enterprise events in a plurality of text data included in the source data set to obtain text key time information corresponding to each text data; and packaging and combining a title keyword list, a text keyword list and text key time information corresponding to each piece of text data in the source data set to obtain an enterprise historical information list corresponding to each piece of text data so as to obtain processed data corresponding to each piece of source data.

In one embodiment, when the step of performing deduplication on the processed data corresponding to the plurality of text data included in the source data one by one to obtain the deduplicated data set corresponding to the source data, the processor 502 performs the following operations: acquiring each piece of processed data, and judging whether the processed data corresponding to a plurality of pieces of text data included in the source data set one by one has repeated processed data according to a title keyword list and a text keyword list included in each piece of processed data; if the processed data corresponding to the text data included in the source data set one by one has repeated processed data, any selected group of the processed data is stored and the processed data which is not selected to be stored is deleted, so that the de-duplicated data set corresponding to the source data is obtained.

In one embodiment, when executing the step of adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list, the processor 502 executes the following operations: according to the occurrence time of the enterprise event included in each enterprise information list, ordering each enterprise history information list in a descending order to obtain an ordered enterprise history information list set; and sequentially adding each of the ordered enterprise history information list sets to a time node corresponding to a preset time axis or calendar according to the descending order of the occurrence time of the enterprise event so as to obtain development track data corresponding to the enterprise name.

Those skilled in the art will appreciate that the embodiment of the computer device shown in fig. 12 is not limiting of the specific construction of the computer device, and in other embodiments, the computer device may include more or less components than those shown, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may include only a memory and a processor, and in such embodiments, the structure and function of the memory and the processor are consistent with the embodiment shown in fig. 12, and will not be described again.

It should be appreciated that in embodiments of the present invention, the Processor 502 may be a central processing unit (Central Processing Unit, CPU), the Processor 502 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATEARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor performs the steps of: crawling a corresponding source data set according to the received enterprise name, and preprocessing the source data set to obtain processed data corresponding to a plurality of text data contained in the source data set one by one; wherein each processed data corresponds to an enterprise history information list; performing duplication removal on the processed data which are contained in the source data and correspond to the text data one by one to obtain a duplication-removed data set corresponding to the source data; acquiring enterprise history information lists included in the duplicate-removed data set, and correspondingly adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list to obtain development track data corresponding to an enterprise name; the names of the time axis or the calendars are generated by the enterprise names corresponding to the enterprise information lists, and the development track data are data in the form of the time axis or the calendars added with the enterprise history information list; and displaying the development track data through a time axis or a calendar.

In an embodiment, the obtaining the enterprise history information list included in the de-duplicated data set correspondingly adds each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of the enterprise event in the included text and the corresponding enterprise history information list, so as to obtain development track data corresponding to the enterprise name; after displaying the development track data through a time axis or a calendar, the method further comprises the following steps: if the entered search keywords are detected, comparing the search keywords with an enterprise historical information list, and acquiring an enterprise historical information list with the pearson similarity with the search keywords exceeding a preset similarity threshold value to serve as a search pushing result.

In an embodiment, the preprocessing the source data set to obtain processed data corresponding to a plurality of text data included in the source data set one to one includes: extracting keyword information from the titles and texts of each text data in the plurality of text data included in the source data set through a word frequency-inverse text frequency index model respectively to obtain a title keyword list and a text keyword list corresponding to each text data; identifying the publishing time of each text data and the occurrence time of enterprise events in a plurality of text data included in the source data set to obtain text key time information corresponding to each text data; and packaging and combining a title keyword list, a text keyword list and text key time information corresponding to each piece of text data in the source data set to obtain an enterprise historical information list corresponding to each piece of text data so as to obtain processed data corresponding to each piece of source data.

In an embodiment, the performing deduplication on the processed data corresponding to the plurality of text data included in the source data one by one to obtain a deduplicated data set corresponding to the source data includes: acquiring each piece of processed data, and judging whether the processed data corresponding to a plurality of pieces of text data included in the source data set one by one has repeated processed data according to a title keyword list and a text keyword list included in each piece of processed data; if the processed data corresponding to the text data included in the source data set one by one has repeated processed data, any selected group of the processed data is stored and the processed data which is not selected to be stored is deleted, so that the de-duplicated data set corresponding to the source data is obtained.

In an embodiment, the adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of the enterprise event in the included text and the corresponding enterprise history information list, to obtain development track data corresponding to the enterprise name includes: according to the occurrence time of the enterprise event included in each enterprise information list, ordering each enterprise history information list in a descending order to obtain an ordered enterprise history information list set; and sequentially adding each of the ordered enterprise history information list sets to a time node corresponding to a preset time axis or calendar according to the descending order of the occurrence time of the enterprise event so as to obtain development track data corresponding to the enterprise name.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of data storage, comprising:

performing duplication removal on the processed data which are contained in the source data and correspond to the text data one by one to obtain a duplication-removed data set corresponding to the source data;

Acquiring enterprise history information lists included in the duplicate-removed data set, and correspondingly adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of an enterprise event in the included text and the corresponding enterprise history information list to obtain development track data corresponding to an enterprise name; the names of the time axis or the calendars are generated by the enterprise names corresponding to the enterprise information lists, and the development track data are data in the form of the time axis or the calendars added with the enterprise history information list; and

Displaying the development track data through a time axis or a calendar;

the step of adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the enterprise event occurrence time and the corresponding enterprise history information list in the included text to obtain development track data corresponding to the enterprise name, including:

According to the occurrence time of the enterprise event included in each enterprise information list, ordering each enterprise history information list in a descending order to obtain an ordered enterprise history information list set;

sequentially adding each of the ordered enterprise history information list sets to a time node corresponding to a preset time axis or calendar according to the descending order of the occurrence time of the enterprise event so as to obtain development track data corresponding to the enterprise name;

and only highlighting the time node corresponding to the time axis or the calendar added with the enterprise history information list.

2. The method for storing data according to claim 1, wherein the obtaining the enterprise history information list included in the de-duplicated dataset includes adding each enterprise information list to a time node corresponding to a time axis or a calendar according to the occurrence time of the enterprise event in the included text and the corresponding enterprise history information list, and obtaining development track data corresponding to the enterprise name; after displaying the development track data through a time axis or a calendar, the method further comprises the following steps:

if the entered search keywords are detected, comparing the search keywords with an enterprise historical information list, and acquiring an enterprise historical information list with the pearson similarity with the search keywords exceeding a preset similarity threshold value to serve as a search pushing result.

3. The data storage method according to claim 1, wherein preprocessing the source data set to obtain processed data corresponding to a plurality of text data included in the source data set one-to-one, includes:

Extracting keyword information from the titles and texts of each text data in the plurality of text data included in the source data set through a word frequency-inverse text frequency index model respectively to obtain a title keyword list and a text keyword list corresponding to each text data;

Identifying the publishing time of each text data and the occurrence time of enterprise events in a plurality of text data included in the source data set to obtain text key time information corresponding to each text data;

And packaging and combining a title keyword list, a text keyword list and text key time information corresponding to each piece of text data in the source data set to obtain an enterprise historical information list corresponding to each piece of text data so as to obtain processed data corresponding to each piece of source data.

4. The method for storing data according to claim 1, wherein the step of performing deduplication on the processed data, which is one-to-one corresponding to the plurality of text data included in the source data, to obtain a deduplicated data set corresponding to the source data, includes:

acquiring each piece of processed data, and judging whether the processed data corresponding to a plurality of pieces of text data included in the source data set one by one has repeated processed data according to a title keyword list and a text keyword list included in each piece of processed data;

If the processed data corresponding to the text data included in the source data set one by one has repeated processed data, any selected group of the processed data is stored and the processed data which is not selected to be stored is deleted, so that the de-duplicated data set corresponding to the source data is obtained.

5. A data storage device, comprising:

The data display unit is used for displaying the development track data through a time axis or a calendar;

the track data acquisition unit includes:

the descending order sorting unit is used for descending order sorting of the enterprise history information lists according to the occurrence time of the enterprise events included in the enterprise information lists to obtain a sorted enterprise history information list set;

the track data mapping unit is used for sequentially adding each enterprise event occurrence time in the ordered enterprise history information list set to a time node corresponding to a preset time axis or calendar according to the descending order of the enterprise event occurrence time so as to obtain development track data corresponding to the enterprise name;

6. The data storage device of claim 5, further comprising:

And the retrieval unit is used for comparing the retrieval keywords with the enterprise historical information list if the entered retrieval keywords are detected, and acquiring the enterprise historical information list with the pearson similarity with the retrieval keywords exceeding a preset similarity threshold value to serve as a retrieval pushing result.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the data storage method according to any of claims 1 to 4 when executing the computer program.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the data storage method according to any one of claims 1 to 4.