CN117807175A

CN117807175A - Data storage method, device, equipment and medium

Info

Publication number: CN117807175A
Application number: CN202311801438.5A
Authority: CN
Inventors: 魏启鹏; 郝香; 蒋红宇
Original assignee: Beijing Haitai Fangyuan High Technology Co Ltd
Current assignee: Beijing Haitai Fangyuan High Technology Co Ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-04-02

Abstract

The application relates to the technical field of computers, in particular to a data storage method, device, equipment and medium, which are used for constructing association relations among unstructured data so as to improve the query efficiency of the unstructured data. The method comprises the following steps: the first device may perform word segmentation processing on the text of each unstructured data according to a natural language processing method, so as to obtain a plurality of words corresponding to each unstructured data. The first device may perform vector conversion processing on the plurality of words according to a preset word vector model, to obtain a vector of the unstructured data. After a plurality of unstructured vectors are obtained, the first device determines clusters corresponding to the unstructured data according to the similarity between the vectors corresponding to the unstructured data, and stores the unstructured data into corresponding storage spaces according to the clusters corresponding to the unstructured data. Wherein, different clusters correspond to different topic modes, and different clusters correspond to storage spaces.

Description

Data storage method, device, equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data storage method, apparatus, device, and medium.

Background

In the field of computer technology, data can be divided into structured data and unstructured data. Wherein structured data is highly organized and well-formatted data, information that can be represented by data or uniform characters, such as numbers, symbols, and the like. Unstructured data is data, other than structured data, whose data structure is irregular or incomplete, without a predefined model. For example, unstructured data includes text files, pictures, audio data, video data, and the like.

In the prior art, the scheme of storing unstructured data typically stores a storage path or an identifier of unstructured data as a field in a database. Only the corresponding unstructured data can be queried through the path information or the identification, and the data related to the unstructured data cannot be obtained. Therefore, the unstructured data is stored in the storage mode, so that the query efficiency of the unstructured data is reduced.

Disclosure of Invention

The embodiment of the application provides a data storage method, device, equipment and medium, which are used for constructing an association relation between unstructured data so as to improve the query efficiency of the unstructured data.

In a first aspect, the present application provides a data storage method, the method comprising: and obtaining a plurality of unstructured data to be stored. And respectively carrying out word segmentation processing on texts in the unstructured data according to a natural language processing method to obtain a plurality of words corresponding to the unstructured data. And respectively carrying out vector conversion processing on words corresponding to the unstructured data according to a preset word vector model to obtain vectors corresponding to the unstructured data. And determining clusters corresponding to the unstructured data according to the similarity between vectors corresponding to the unstructured data, wherein different clusters correspond to different storage spaces. And storing the plurality of unstructured data into a storage space corresponding to the cluster according to the cluster corresponding to the unstructured data.

According to the method, the plurality of unstructured data are divided into different clusters according to the similarity of vectors of the plurality of unstructured data, and the unstructured data are stored in corresponding storage spaces according to the clusters corresponding to the unstructured data. That is, unstructured data with similar vectors can be stored in the same storage space, thereby constructing an association relationship between a plurality of unstructured data. In addition, as different clusters correspond to different topic modes, the relation between a plurality of unstructured data and topic modes can be constructed, so that the inquiry according to the topic modes is realized, a plurality of unstructured data can be obtained, and the inquiry efficiency of the unstructured data is improved.

In one implementation, text in the plurality of unstructured data is preprocessed according to a natural language processing method to obtain the preprocessed plurality of unstructured data. Word segmentation is carried out according to the preprocessed unstructured data to obtain a plurality of words corresponding to the unstructured data.

By the method, the text is subjected to noise removal processing, and the accuracy of word segmentation of the text can be improved. The text is filtered according to the stop words, so that the storage space can be saved and the searching efficiency can be improved.

In one implementation, a plurality of words of unstructured data are tagged with parts of speech, and semantic categories of the plurality of words are determined. And determining the topic mode of the corresponding cluster of unstructured data according to the semantic categories of the words.

In this way, the topic mode of the corresponding cluster of the unstructured data is determined according to the semantic category of the plurality of words, and the association relationship between the unstructured data is constructed, so that the query efficiency of the unstructured data can be improved.

In one implementation, the data used to indicate the subject pattern is structured data.

By the method, the association relation between the unstructured data and the structured data can be constructed, so that the query efficiency of the unstructured data is improved.

In a second aspect, the present application provides a data storage device comprising a communication module and a processing module. The communication module is used for acquiring a plurality of unstructured data to be stored. And the processing module is used for respectively carrying out word segmentation processing on texts in the unstructured data according to the natural language processing device to obtain a plurality of words corresponding to the unstructured data. The processing module is further used for respectively carrying out vector conversion processing on words corresponding to the unstructured data according to a preset word vector model to obtain vectors corresponding to the unstructured data. And the processing module is also used for determining clusters corresponding to the unstructured data according to the similarity between vectors corresponding to the unstructured data, and different clusters correspond to different storage spaces. And the processing module is also used for storing the plurality of unstructured data into a storage space corresponding to the cluster according to the cluster corresponding to the unstructured data.

In one implementation, the processing module is specifically configured to: preprocessing the text in the unstructured data according to the natural language processing device to obtain the preprocessed unstructured data. Word segmentation is carried out according to the preprocessed unstructured data to obtain a plurality of words corresponding to the unstructured data.

In one implementation, the processing module is further to: and marking parts of speech of a plurality of words of unstructured data, and determining semantic categories of the plurality of words. And determining the topic mode of the corresponding cluster of unstructured data according to the semantic categories of the words.

In a third aspect, the present application provides an electronic device, comprising:

a memory for storing program instructions;

a processor for invoking program instructions stored in the memory and executing the steps comprised by the method according to any of the first aspects in accordance with the obtained program instructions.

In a fourth aspect, the present application provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any one of the first aspects.

In a fifth aspect, the present application provides a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the first aspects.

The technical effects of the second aspect to the fifth aspect and any one of the designs thereof may be referred to as the technical effects of the corresponding designs in the first aspect, and will not be described herein.

Drawings

FIG. 1 is a flow chart of a data storage method according to the prior art;

fig. 2 is a schematic structural diagram of a data storage method system according to an embodiment of the present application;

fig. 3 is a schematic diagram of a semantic network structure according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a data storage device according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure. Embodiments and features of embodiments in this application may be combined with each other arbitrarily without conflict. Also, while a logical order of illustration is depicted in the flowchart, in some cases the steps shown or described may be performed in a different order than presented.

The terms first and second in the description and claims of the present application and in the above-described figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, two, three or more, and embodiments of the present application are not limited.

In the technical scheme, the data are collected, transmitted, used and the like, and all meet the requirements of national related laws and regulations.

Before describing the data storage method provided in the embodiments of the present application, for ease of understanding, a detailed description is first provided to the technical background of the embodiments of the present application.

In the prior art unstructured data storage schemes, unstructured data is typically stored in a specified file system in the form of a file, and a storage path or identifier of the unstructured data is stored as a field in a database, so as to facilitate querying of the unstructured data. The path information or identifier stored in the database generally only indicates the corresponding unstructured data, and cannot reveal the relationship between the data and other data.

In practical applications, when unstructured data is queried, data with an association relationship with the data is generally required to be acquired. However, in the prior art, when unstructured data is queried, the corresponding unstructured data can only be queried through stored path information or identification, and the data related to the data can not be obtained. Therefore, the unstructured data is stored in the storage mode, so that the query efficiency of the unstructured data is reduced.

In order to solve the above-mentioned drawbacks, embodiments of the present application provide a data storage method, apparatus, device, and medium, which are used to construct an association relationship between unstructured data, so as to improve the query efficiency of unstructured data.

In the application, the method comprises the following steps: the first device may perform the following processing on each of the acquired plurality of unstructured data: the first device may perform word segmentation processing on the text of each unstructured data according to a natural language processing method, so as to obtain a plurality of words corresponding to each unstructured data. The first device may perform vector conversion processing on the plurality of words according to a preset word vector model, to obtain a vector of unstructured data. After a plurality of unstructured vectors are obtained, the first device determines clusters corresponding to the unstructured data according to the similarity between the vectors corresponding to the unstructured data, and stores the unstructured data into corresponding storage spaces according to the clusters corresponding to the unstructured data. Wherein, different clusters correspond to different topic modes, and different clusters correspond to storage spaces.

It can be appreciated that the first device may divide the plurality of unstructured data into different clusters according to the similarity of vectors of the plurality of unstructured data, and store the unstructured data into the corresponding storage space according to the clusters corresponding to the unstructured data. That is, the first device may store unstructured data with similar vectors in the same storage space, thereby constructing an association relationship between a plurality of unstructured data. In addition, as different clusters correspond to different topic modes, the relation between a plurality of unstructured data and the topic modes can be constructed, and the plurality of unstructured data can be obtained when the query is carried out according to the topic modes, so that the query efficiency of the unstructured data is improved.

Furthermore, the first device may be a processing device for performing the method shown in the present application, or the first device may be a processing apparatus, such as a processor or a processing module, in a computer system for performing the method shown in the present application, which is not particularly limited in the present application.

Fig. 1 is a flow chart of a data storage method according to an embodiment of the present invention. Taking the first device as an execution body as an example, the process may include the following steps:

s101: the first device obtains a plurality of unstructured data to be stored.

In particular, unstructured data in the present application may be unstructured data including text, for example, unstructured data may include text documents, mail, office documents, and the like. The user may send unstructured data to be stored to the first device via the input device, and correspondingly, the first device receives the unstructured data to be stored. Alternatively, the first device may obtain unstructured data to be stored from its own stored data. For example, the first device may include a cache module from which the first device may obtain unstructured data to be stored. The data in the buffer module may be pre-stored data. In addition, the first device may further obtain unstructured data to be stored through other manners, which is not specifically limited in this application.

Fig. 2 is a schematic structural diagram of a data storage system according to an embodiment of the present application. As shown in fig. 2, after acquiring the data to be stored, the first device may determine that the type of the data is structured data or unstructured data.

S102: the first device performs word segmentation processing on texts in the unstructured data according to a natural language processing method to obtain a plurality of words corresponding to the unstructured data.

Specifically, natural language processing can realize effective communication between a person and a computer by using natural language, and can be used for performing viewpoint extraction, text classification, text semantic comparison, word segmentation and the like on texts.

After obtaining the plurality of unstructured data to be stored, the first device may perform the following operation on text in each of the plurality of unstructured data to be stored:

in one or more embodiments, the first device may pre-process text in the unstructured data according to a natural language processing method to obtain the pre-processed unstructured data. Wherein the preprocessing includes removing noise and disabling words.

The first device may perform word segmentation according to the pre-processed unstructured data, to obtain a plurality of words corresponding to the unstructured data.

Specifically, since the text in the unstructured data may have the problems of grammar errors, wrongly written characters or punctuation errors and the like, before word segmentation processing is performed on the text in the unstructured data, the first device can perform noise removal processing on the text in the unstructured data according to a natural language processing method, so that the problems of grammar errors, wrongly written characters or punctuation errors and the like in the text are solved. It can be appreciated that the noise removal processing is performed on the text, so that the accuracy of the subsequent word segmentation processing on the text can be improved.

The first device may process the text according to preset stop words, and filter the stop words in the text. The stop words may be words (such as adjectives, adverbs or connective words) in unstructured data, which are not related to the subject, or words set by the user according to the needs of the user. For example, the user may pre-configure a stop word list including at least one stop word, and when the first device performs preprocessing on the text, the words included in the stop word list in the text may be deleted according to the stop word list. It can be understood that the text is filtered according to the stop words, so that the storage space can be saved and the searching efficiency can be improved.

The word segmentation processing method for the preprocessed text can comprise a dictionary-based word segmentation method and a statistical-based word segmentation method.

The dictionary-based word segmentation method may be to split a text to be segmented into a plurality of parts and compare each part with a pre-created dictionary. If the word is in the dictionary, the word segmentation is successful; if the word is not in the dictionary, the word is further split, and then the split word is compared with the dictionary.

The statistical-based word segmentation method may be based on a pre-configured corpus, and statistics of the probability of occurrence of words of adjacent word components. I.e. the probability of the occurrence of a word is determined on the basis of the number of occurrences of the neighboring word. And performing word segmentation on the text according to the probability value of the word.

In addition, other methods can be used for word segmentation processing on the text, and the application is not particularly limited.

Optionally, after the text is segmented, the first device may label parts of speech for a plurality of words of the unstructured data, and determine semantic categories of the plurality of words.

Based on step S102, the first device performs word segmentation processing on the text, so that the accuracy and conversion efficiency of subsequent vector conversion on the text can be improved.

S103: the first device performs vector conversion processing on words corresponding to the unstructured data according to a preset word vector model to obtain vectors corresponding to the unstructured data.

Step S103 will be described below by taking any unstructured data among a plurality of unstructured data as an example:

specifically, after word segmentation is performed on the text in the unstructured data by the first device, the first device may extract the feature of each word according to a preset word vector model, and perform vector representation on the feature of each word, so as to obtain a feature vector of each word. For example, the feature vector of a term may be obtained by determining the feature of the term according to the frequency of occurrence of the term in the text and the importance of the term in the text, and representing the feature vector in the form of a vector.

The first device may splice feature vectors of words corresponding to unstructured data to form a vector including all word features, i.e., a vector of unstructured data. Exemplary, a vector of unstructured data may be represented as X, and a vector of n words may be represented as X ₁ ，x ₂ ，x ₃ ，……，x _n . Wherein n is a positive integer. The vectors of unstructured data and the vectors of n words satisfy:

X＝(x ₁ ，x ₂ ，x ₃ ，……，x _n )。

s104: the first device determines clusters corresponding to the unstructured data according to the similarity between vectors corresponding to the unstructured data. Wherein different clusters correspond to different topic patterns.

Specifically, the first device may calculate the similarity between vectors of unstructured data, and the unstructured data whose similarity is greater than a set threshold is divided into the same cluster (clusters may also be referred to as packets). The threshold may be any value that is set by the user according to the user's own needs, for example, the threshold may be 0.9.

Illustratively, the plurality of unstructured data includes first unstructured data and second unstructured data, wherein the first unstructured data may be represented as a, the second unstructured data may be represented as B, and the vector of the first unstructured data may be represented as x= (X) ₁ ，x ₂ ，x ₃ ，……，x _n ) The vector of the second unstructured data may be represented as y= (Y) ₁ ，y ₂ ，y ₃ ，……，y _n ) The similarity may be expressed as simlar, and the vector of the first unstructured data, the vector of the second unstructured data, and the similarity satisfy:

wherein Simlar (A, B) represents the similarity of A and B.

It is understood that one unstructured data may correspond to at least one cluster, with different clusters corresponding to different topic patterns. That is, one unstructured data may correspond to at least one topic pattern. For example, the topic mode corresponding to the first cluster is author a, and the topic mode corresponding to the second cluster is domain B. If the author of the first unstructured data is A and the domain is B, the first unstructured data belongs to both the first cluster and the second cluster.

Exemplary, fig. 3 is a schematic diagram of a semantic network structure according to an embodiment of the present application. As shown in fig. 3, A1 and A2 may represent two different theme modes, respectively, and text 1, text 2, text 3, text 4, and text 5 may represent five different unstructured data, respectively. Wherein, A1 has association relation with text 1, text 2 and text 3, and A2 has association relation with text 3, text 4 and text 5. That is, text 3 may correspond to two different theme modes, A1 and A2, respectively. For example, A1 may represent an author, and the authors corresponding to the description text 1, the description text 2, and the description text 3 are all A1. A2 may be represented as the related field of text, and the related fields of explanatory text 3, text 4, and text 5 are A2.

Optionally, the first device may determine the topic mode corresponding to the cluster according to semantic categories of a plurality of words of unstructured data in the same cluster.

The first device may construct an association relationship between unstructured data in a topic mode corresponding to a plurality of unstructured data.

Based on step S104, the first device divides the plurality of unstructured data into different groups according to the similarity between vectors of the unstructured data, and constructs an association relationship between the unstructured data, so that the query efficiency of the unstructured data can be improved.

Alternatively, the data used to indicate the subject mode in the present application may be structured data.

Specifically, the data for indicating the theme mode may be text corresponding to the theme mode, or may be a unique identifier of the theme mode. The first device may store data indicating the subject pattern in a structured database. Alternatively, the first device may take the data in the structured database as a theme pattern or indicate a theme pattern.

Based on the embodiment, the association relation between unstructured data and structured data can be constructed, so that the data query efficiency is improved.

S105: and the first device stores the plurality of unstructured data into the corresponding storage space according to the cluster corresponding to the unstructured data.

Specifically, the first device may divide the unstructured database into a plurality of storage spaces according to different clusters or different topic modes. The first device may store a part or all of the content of the unstructured data to the corresponding storage space according to a cluster or a theme mode corresponding to the unstructured data.

For example, as shown in fig. 2, the first device may further extract keywords and/or metadata in the unstructured data and configure a unique identifier for the unstructured data before storing the unstructured data. Wherein the first device may have keywords and/or metadata of unstructured data as identifiers of the data or may have keywords and/or metadata as part of the content of the identifiers of the data.

The first device may store an identifier of unstructured data to a storage space corresponding to a subject pattern of the data and store the unstructured data to a separate file system. It can be appreciated that by storing unstructured data in this way, the unstructured data with relevance can be stored in the same storage space, and meanwhile, the storage space occupied by the unstructured data can be reduced.

The method provided in this application is described in detail below by way of example 1. In embodiment 1, the structured data and the unstructured data are stored in different storage modes respectively, and the stored data are uploaded to the blockchain, so that the integrity and consistency of the data are ensured.

The details of example 1 are as follows:

as shown in fig. 2, after acquiring data to be stored, the first device may determine a data type of the data. Wherein the data types include structured data and unstructured data.

If the data type of the data to be stored is unstructured data, the first device stores the data into an independent file system, extracts keywords and metadata, classifies or clusters the text and the like, and stores the keywords, the metadata and the like of the data into a database as structured data.

In addition, the first device may store a portion of the content of the unstructured data into a search engine and use the search engine to construct an index structure for the stored unstructured data. For example, the search engine may define a field mapping relationship of unstructured data, may split text of the unstructured data into a plurality of terms according to an analyzer, and may set a filter to filter some of the terms in the unstructured data. It will be appreciated that integrating part of the content of unstructured data into a search engine in this manner can reduce the complexity of query and processing of unstructured data.

The first device may construct an association between unstructured data. The association relationship between the first device and the unstructured data may refer to steps S102 to S104, and will not be described herein.

If the data type of the data to be stored is structured data, the first device may add a data source to the data and determine a corresponding model and/or table structure according to the content of the data. The first device stores the data in a database according to the determined model and/or table structure.

The first device may configure the structured data in the database with a unique identification code and/or index, such as a number, bar code, two-dimensional code, or the like. Wherein the structured data comprises keywords of unstructured data, metadata and other contents.

In addition, when the first device stores data, the first device may trigger generation of data change information and send the data change information to the message queue. Wherein, a trigger or a monitor for monitoring the event of the data change can be arranged in the database, and when the trigger or the monitor monitors the event of the data change, the trigger or the monitor can send the event of the data change to the message queue. Message queues can provide a reliable information delivery and processing mechanism, ensuring reliability and consistency of data synchronization. In this way, the complexity of data management and maintenance can be reduced.

The first device may import the corresponding structured data and unstructured data to the search engine according to the information in the message queue, and upload the related information of the structured data and unstructured data to the blockchain memory card.

The following describes uploading blockchain certificates of related information of structured data and unstructured data:

the first device may upload information such as a trusted time stamp, an index, metadata, a data change operation, etc. of the structured data and the unstructured data to the blockchain, perform digest processing on the information using a hash algorithm, and store a hash value corresponding to the information in a contract of the blockchain. It can be appreciated that uploading the related information of the structured data and the unstructured data to the blockchain can achieve transparency and traceability of the data, thereby improving the security of the stored data.

In the application, the memory database and the distributed cache can be used as a cache layer of the database, and frequently accessed data is stored in the memory, so that the data reading performance is improved, the load of the database is reduced, and the data query efficiency is improved.

Based on the same inventive concept, embodiments of the present application provide a data storage device. Fig. 4 shows a schematic structural diagram of a data storage device according to an embodiment of the present application. As shown in fig. 4, the apparatus includes a communication module and a processing module.

The communication module 401 is configured to obtain a plurality of unstructured data to be stored. The processing module 402 is configured to perform word segmentation processing on text in the unstructured data according to the natural language processing device, so as to obtain a plurality of words corresponding to the unstructured data. The processing module 402 is further configured to perform vector conversion processing on words corresponding to the plurality of unstructured data according to a preset word vector model, so as to obtain vectors corresponding to the plurality of unstructured data. The processing module 402 is further configured to determine clusters corresponding to the plurality of unstructured data according to similarities between vectors corresponding to the plurality of unstructured data, where different clusters correspond to different topic modes. The processing module 402 is further configured to store a plurality of unstructured data to a storage space corresponding to the cluster according to the cluster corresponding to the unstructured data.

In one implementation, the processing module 402 is specifically configured to: preprocessing text in the unstructured data according to the natural language processing device to obtain the unstructured data after preprocessing, wherein the preprocessing comprises noise removal and word stopping. Word segmentation is carried out according to the preprocessed unstructured data to obtain a plurality of words corresponding to the unstructured data.

In one implementation, the processing module 402 is further configured to: and marking parts of speech of a plurality of words of unstructured data, and determining semantic categories of the plurality of words. And determining the topic mode of the corresponding cluster of unstructured data according to the semantic categories of the words.

Based on the same inventive concept, the embodiments of the present application provide an electronic device, which may implement the functions of the foregoing discussion apparatus. Fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

The electronic device in embodiments of the present application may include a processor 501. The processor 501 is the control center of the device and may connect the various parts of the device using various interfaces and lines by running or executing instructions stored in the memory 503 and invoking data stored in the memory 503. Alternatively, the processor 501 may include one or more processing units, and the processor 501 may integrate an application processor and a modem processor, wherein the application processor primarily processes an operating system and application programs, etc., and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501. In some embodiments, the processor 501 and the memory 503 may be implemented on the same chip, and in some embodiments they may be implemented separately on separate chips.

The processor 501 may be a general purpose processor such as a central processing unit (Central Processing Unit, CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, which may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be performed directly by a hardware processor or by a combination of hardware and software modules in a processor.

In the embodiments of the present application, the memory 503 stores instructions executable by the at least one processor 501, and the at least one processor 501 may be configured to perform the method steps disclosed in the embodiments of the present application by executing the instructions stored in the memory 503.

The memory 503 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 503 may include at least one type of storage medium, and may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (StaticRandom Access Memory, SRAM), programmable Read Only Memory (Programmable Read OnlyMemory, PROM), read Only Memory (ROM), charged erasable programmable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. Memory 503 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 503 in the embodiments of the present application may also be circuitry or any other device capable of implementing a storage function for storing program instructions and/or data.

In the embodiment of the application, the apparatus may further include a communication interface 502, and the electronic device may transmit data through the communication interface 502.

Alternatively, the processing module 402 and/or the communication module 401 shown in fig. 4 may be implemented by the processor 501 (or the processor 501 and the communication interface 502) shown in fig. 5, that is, the actions of the processing module 402 and/or the communication module 401 may be performed by the processor 501 (or the processor 501 and the communication interface 502).

Based on the same inventive concept, the present embodiments also provide a computer-readable storage medium in which instructions may be stored, which when run on a computer, cause the computer to perform the operational steps provided by the above-described method embodiments. The computer readable storage medium may be the memory 503 shown in fig. 5.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of data storage, the method comprising:

obtaining a plurality of unstructured data to be stored;

word segmentation processing is respectively carried out on texts in the unstructured data according to a natural language processing method, so that a plurality of words corresponding to the unstructured data are obtained;

respectively carrying out vector conversion processing on words corresponding to the unstructured data according to a preset word vector model to obtain vectors corresponding to the unstructured data;

determining clusters corresponding to the unstructured data according to the similarity between vectors corresponding to the unstructured data, wherein different clusters correspond to different storage spaces;

and storing the unstructured data into a storage space corresponding to the cluster according to the cluster corresponding to the unstructured data.

2. The method of claim 1, wherein the performing word segmentation on the text in the unstructured data according to the natural language processing method to obtain a plurality of words corresponding to the unstructured data includes:

preprocessing texts in the unstructured data according to the natural language processing method to obtain preprocessed unstructured data;

and performing word segmentation processing according to the preprocessed unstructured data to obtain words corresponding to the unstructured data.

3. The method of claim 1, wherein the method further comprises:

part-of-speech tagging is carried out on a plurality of words of the unstructured data, and semantic categories of the words are determined;

and determining the topic mode of the corresponding cluster of the unstructured data according to the semantic category of the plurality of words.

4. A method as claimed in claim 1 or 3, wherein the data indicative of the subject pattern is structured data.

5. A data storage device, the device comprising:

the communication module is used for acquiring a plurality of unstructured data to be stored;

the processing module is used for respectively carrying out word segmentation processing on texts in the unstructured data according to the natural language processing device to obtain a plurality of words corresponding to the unstructured data;

the processing module is further used for respectively carrying out vector conversion processing on words corresponding to the unstructured data according to a preset word vector model to obtain vectors corresponding to the unstructured data;

the processing module is further configured to determine clusters corresponding to the plurality of unstructured data according to similarities between vectors corresponding to the plurality of unstructured data, where different clusters correspond to different storage spaces;

the processing module is further configured to store the plurality of unstructured data to a storage space corresponding to the cluster according to the cluster corresponding to the unstructured data.

6. The apparatus of claim 5, wherein the processing module is specifically configured to:

preprocessing texts in the unstructured data according to the natural language processing device to obtain the preprocessed unstructured data;

7. The apparatus of claim 5, wherein the processing module is further to:

8. The apparatus of claim 5 or 7, wherein the data indicating the subject mode is structured data.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-4.

10. A computer program product, the computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the preceding claims 1-4.