CN117390013A - Data storage method, retrieval method, system, device and storage medium - Google Patents

Data storage method, retrieval method, system, device and storage medium Download PDF

Info

Publication number
CN117390013A
CN117390013A CN202311182707.4A CN202311182707A CN117390013A CN 117390013 A CN117390013 A CN 117390013A CN 202311182707 A CN202311182707 A CN 202311182707A CN 117390013 A CN117390013 A CN 117390013A
Authority
CN
China
Prior art keywords
data set
dimensional vector
low
dimensional
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311182707.4A
Other languages
Chinese (zh)
Inventor
郭玮
苏力强
廖定柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bohan Intelligent Shenzhen Co ltd
Original Assignee
Bohan Intelligent Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bohan Intelligent Shenzhen Co ltd filed Critical Bohan Intelligent Shenzhen Co ltd
Priority to CN202311182707.4A priority Critical patent/CN117390013A/en
Publication of CN117390013A publication Critical patent/CN117390013A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data storage method, a retrieval method, a system, equipment and a storage medium, and belongs to the technical field of data processing. The method comprises the following steps: acquiring a target data set to be stored, and extracting the data type of the target data set, wherein the target data set comprises original data characteristics; according to the data types, respectively carrying out high-dimensional conversion and low-dimensional conversion on the target data set to obtain a high-dimensional vector data set comprising high-dimensional vector features and a low-dimensional vector data set comprising low-dimensional vector features; according to a preset index structure, the low-dimensional vector data set, the high-dimensional vector data set and the target data set are respectively stored in a first vector layer, a second vector layer and an original data layer of the database, so that the target data set is obtained when the search is carried out according to the low-dimensional vector features or the high-dimensional vector features, and the high-dimensional vector features, the low-dimensional vector features and the original data features are related to each other. The method and the device can realize efficient storage and quick retrieval of the target data.

Description

Data storage method, retrieval method, system, device and storage medium
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a data storage method, a data retrieval system, a data storage device, and a data storage medium.
Background
With the continuous development and popularization of internet information technology, traditional paper files are gradually replaced by digital documents, traditional business is gradually turned to an e-commerce platform, more and more users choose to publish digital contents on social media platforms, online forums and the like, which results in the generation of a large amount of data including different types of data such as text, images and audio, and therefore, the storage management of the data is necessary.
At present, different types of data are respectively stored in different databases, when a certain type of data needs to be searched, the corresponding database needs to be searched first, then the data is searched according to key information in the original data, the complexity of the data during searching is increased by using a plurality of databases to store the different types of data, and the data content needs to be analyzed one by one and matched according to the key information of the original data, so that the searching efficiency of the data is seriously influenced by the mode of storing and searching the data.
Disclosure of Invention
The main purpose of the embodiments of the present application is to provide a data storage method, a retrieval method, a system, a device and a storage medium, which can realize efficient storage and rapid retrieval of target data.
To achieve the above object, a first aspect of an embodiment of the present application proposes a data storage method, including: acquiring a target data set to be stored, and extracting the data type of the target data set, wherein the target data set comprises original data characteristics; performing high-dimensional conversion on the target data set according to the data type to obtain a high-dimensional vector data set, wherein the high-dimensional vector data set comprises high-dimensional vector features; performing low-dimensional conversion on the high-dimensional vector data set according to the data type to obtain a low-dimensional vector data set, wherein the low-dimensional vector data set comprises low-dimensional vector features; and respectively storing the low-dimensional vector data set, the high-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database according to a preset index structure, so that the target data set is obtained when searching according to the low-dimensional vector feature or the high-dimensional vector feature, wherein the high-dimensional vector feature, the low-dimensional vector feature and the original data feature are related to each other.
In some embodiments, the performing high-dimensional transformation on the target data set according to the data type to obtain a high-dimensional vector data set includes: when the data type is the first type, performing word segmentation operation on the target data set to obtain a word segmentation result, and performing high-dimensional vector mapping operation on the word segmentation result to obtain a high-dimensional vector data set; and when the data type is the second type, inputting the target data set into a pre-trained conversion model, performing feature extraction operation on the target data set according to the conversion model to obtain high-dimensional vector features, and performing high-dimensional vectorization operation on the high-dimensional vector features to obtain a high-dimensional vector data set.
In some embodiments, the performing high-dimensional transformation on the target data set according to the data type to obtain a high-dimensional vector data set further includes: when the target data set to be stored comprises at least two mutually related sub-target data sets, respectively carrying out high-dimensional conversion on each sub-target data set according to the data type of each sub-target data set to obtain a plurality of sub-high-dimensional vector data sets containing sub-high-dimensional vector features; and vector fusion is carried out on the plurality of sub-high-dimensional vector data sets, so that a high-dimensional vector data set comprising a plurality of sub-high-dimensional vector features is obtained.
In some embodiments, the performing low-dimensional conversion on the high-dimensional vector data set according to the data type to obtain a low-dimensional vector data set includes: when the data type is linear, extracting a plurality of linear vector features from the high-dimensional vector data set, and performing low-dimensional vector mapping operation on the high-dimensional vector data set according to the linear vector features to obtain a low-dimensional vector data set; and when the data type is a nonlinear type, calculating a similarity value between nonlinear vector features in the high-dimensional vector data set, and performing low-dimensional vector mapping operation on the nonlinear vector features according to the similarity value to obtain a low-dimensional vector data set.
In some embodiments, the storing the low-dimensional vector data set, the high-dimensional vector data set, and the target data set in a first vector layer, a second vector layer, and an original data layer of a database according to a predetermined index structure includes: respectively extracting index information of the low-dimensional vector data set, the high-dimensional vector data set and the target data set, and respectively determining index structures corresponding to the high-dimensional vector data set, the low-dimensional vector data set and the target data set according to the index information; dividing the high-dimensional vector data set, the low-dimensional vector data set and the target data set into a plurality of characteristic data sets respectively, wherein the characteristic data sets comprise a plurality of characteristic data; carrying out dimension variance calculation on the feature data of each feature data set to obtain a dimension maximum variance value, calculating an average value or a mode value of the vector data, and obtaining a dimension score value according to the average value or the mode value; and acquiring storage partitions of each characteristic data set, and storing the high-dimensional vector data set, the low-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database according to the corresponding index structures according to the maximum variance value, the dimension partition value and the storage partitions.
In some embodiments, after the storing the low-dimensional vector data set, the high-dimensional vector data set, and the target data set in the first vector layer, the second vector layer, and the original data layer of the database according to the preset index structure, the method further includes: recording the search times of each high-dimensional vector data set, each low-dimensional vector data set and each target data set; and when the search times exceed a preset search threshold, adjusting the high-dimensional vector data set/the low-dimensional vector data set/the target data set to a priority node of a corresponding index structure, wherein the priority node is used for representing a priority search position of data.
To achieve the above object, a second aspect of the embodiments of the present application proposes a data retrieval method, including: acquiring target retrieval characteristics of a target data set to be retrieved; if the target retrieval feature is a low-dimensional retrieval feature, determining a low-dimensional vector data set and a low-dimensional vector feature corresponding to the low-dimensional vector data set in a first vector layer of a database according to the low-dimensional retrieval feature; determining a high-dimensional vector data set and high-dimensional vector features corresponding to the high-dimensional vector data set in a second vector layer of a database according to the low-dimensional vector features after high-dimensional conversion; and determining a target data set corresponding to the high-dimensional vector feature from an original data layer of a database according to the high-dimensional vector feature after the low-dimensional conversion, wherein the retrieved target data set comprises the original data feature, the original data feature is matched with the target retrieval feature, and the high-dimensional vector feature, the low-dimensional vector feature and the original data feature are associated with each other.
To achieve the above object, a third aspect of the embodiments of the present application proposes a data storage system, the system comprising: the device comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a target data set to be stored and extracting the data type of the target data set, wherein the target data set comprises original data characteristics; the high-dimensional conversion module is used for carrying out high-dimensional conversion on the target data set according to the data type to obtain a high-dimensional vector data set, wherein the high-dimensional vector data set comprises high-dimensional vector features; the low-dimensional conversion module is used for carrying out low-dimensional conversion on the high-dimensional vector data set according to the data type to obtain a low-dimensional vector data set, wherein the low-dimensional vector data set comprises low-dimensional vector features; the storage module is used for respectively storing the low-dimensional vector data set, the high-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database according to a preset index structure so as to obtain the target data set when searching according to the low-dimensional vector feature or the high-dimensional vector feature, wherein the high-dimensional vector feature, the low-dimensional vector feature and the original data feature are related to each other.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement the method according to the embodiment of the first aspect or the method according to the embodiment of the second aspect.
To achieve the above object, a fifth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, storing a computer program, which when executed by a processor implements the method according to the embodiment of the first aspect or the method according to the embodiment of the second aspect.
The embodiment of the application provides a data storage method, a retrieval system, a device and a storage medium, which firstly acquire a target data set to be stored and extract the data type of the target data set, wherein the target data set comprises original data characteristics; then, performing high-dimensional conversion on the target data set according to the data type to obtain a high-dimensional vector data set, wherein the high-dimensional vector data set comprises high-dimensional vector features; then, performing low-dimensional conversion on the high-dimensional vector data set according to the data type to obtain a low-dimensional vector data set, wherein the low-dimensional vector data set comprises low-dimensional vector features; and storing the low-dimensional vector data set, the high-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database respectively according to a preset index structure, so that the target data set is obtained when searching according to the low-dimensional vector features or the high-dimensional vector features, wherein the high-dimensional vector features, the low-dimensional vector features and the original data features are related to each other.
The target data set, the high-dimensional vector data set and the low-dimensional vector data set corresponding to the target data set can be stored in the same database, and are not required to be stored in a plurality of databases respectively, so that efficient storage of multiple data modes of the target data is realized; in addition, by such a data storage method, when searching the target data set, the original data features of the target data set do not need to be searched one by one, but the high-dimensional vector features and the low-dimensional vector features which are related to the original data features can be searched, so that the searching range of the target data set can be narrowed by the low-dimensional vector data and the high-dimensional vector data, and the searching speed of the low-dimensional vector data is very rapid, so that the rapid searching of the target data can be realized.
Drawings
Fig. 1 is a schematic view of an application scenario of a data storage system provided in an embodiment of the present application;
FIG. 2 is an alternative flow chart of a data storage method provided by an embodiment of the present application;
FIG. 3 is a flow chart of one implementation of step S102 of FIG. 2;
FIG. 4 is a flowchart of another implementation of step S102 of FIG. 2;
FIG. 5 is a flow chart of one implementation of step S103 of FIG. 2;
FIG. 6 is a flow chart of one implementation of step S104 of FIG. 2;
FIG. 7 is a flow chart of one implementation after step S104 of FIG. 2;
FIG. 8 is an alternative flow chart of a data retrieval method provided by an embodiment of the present application;
FIG. 9 is a schematic diagram of a functional module of a data store provided in an embodiment of the present application;
fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
high-dimensional vector data refers to data represented in a multi-dimensional space, where each data point is composed of multiple features or dimensions, which may be different types of data, such as numerical values, categories, text, and the like. In an embodiment of the present application, the high-dimensional vector data set includes one or more high-dimensional vector data.
Low-dimensional vector data refers to data represented in a lower dimensional space, where each data point consists of a small number of features or dimensions, with the dimensions of the low-dimensional vector data being smaller relative to the high-dimensional vector data. In an embodiment of the present application, the low-dimensional vector data set includes one or more low-dimensional vector data.
With the continuous development and popularization of internet information technology, traditional paper files are gradually replaced by digital documents, traditional business is gradually turned to an e-commerce platform, more and more users choose to publish digital contents on social media platforms, online forums and the like, which results in the generation of a large amount of data including different types of data such as text, images and audio, and therefore, the storage management of the data is necessary.
At present, different types of data are respectively stored in different databases, when a certain type of data needs to be searched, the corresponding database needs to be searched first, then the data is searched according to key information in the original data, the complexity of the data during searching is increased by using a plurality of databases to store the different types of data, and the data content needs to be analyzed one by one and matched according to the key information of the original data, so that the searching efficiency of the data is seriously influenced by the mode of storing and searching the data.
Based on this, the data storage method, the data retrieval system, the data storage device and the data storage medium provided by the embodiment of the application firstly acquire a target data set to be stored and extract the data type of the target data set, wherein the target data set comprises original data characteristics; then, performing high-dimensional conversion on the target data set according to the data type to obtain a high-dimensional vector data set, wherein the high-dimensional vector data set comprises high-dimensional vector features; then, performing low-dimensional conversion on the high-dimensional vector data set according to the data type to obtain a low-dimensional vector data set, wherein the low-dimensional vector data set comprises low-dimensional vector features; and storing the low-dimensional vector data set, the high-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database respectively according to a preset index structure, so that the target data set is obtained when searching according to the low-dimensional vector features or the high-dimensional vector features, wherein the high-dimensional vector features, the low-dimensional vector features and the original data features are related to each other.
The target data set, the high-dimensional vector data set and the low-dimensional vector data set corresponding to the target data set can be stored in the same database, and are not required to be stored in a plurality of databases respectively, so that efficient storage of multiple data modes of the target data is realized; in addition, by such a data storage method, when searching the target data set, the original data features of the target data set do not need to be searched one by one, but the high-dimensional vector features and the low-dimensional vector features which are related to the original data features can be searched, so that the searching range of the target data set can be narrowed by the low-dimensional vector data and the high-dimensional vector data, and the searching speed of the low-dimensional vector data is very rapid, so that the rapid searching of the target data can be realized.
The data storage method, the data storage system, the data storage device and the data storage medium provided by the embodiment of the application are specifically described through the following embodiments, and the data storage system in the embodiment of the application is first described.
As shown in fig. 1, fig. 1 is a schematic view of an application scenario of a data storage system according to an embodiment of the present application, where the data storage system (hereinafter referred to as a "system") includes a user side 11 and a server side 12, where the server side 12 includes a database, and the database is capable of storing a high-dimensional vector data set, a low-dimensional vector data set, and a target data set in different data layers, respectively; the user may perform a search operation at the user terminal 11, and when the user inputs a target search feature at the user terminal 11, the server terminal 12 may perform a search in the database according to the target search feature, to obtain a target data set. It should be noted that, the server 12 may also be connected to a plurality of user terminals 11 to implement data retrieval operation of the multi-terminal user, and the setting is specifically performed according to actual needs of operators, which is not limited in this embodiment of the present application.
The data storage method in the embodiment of the present application may be described by the following embodiment.
In some embodiments, the embodiments of the present application may also acquire and process related data based on artificial intelligence techniques. For example, the target data set may be high-dimensional transformed and low-dimensional transformed by artificial intelligence for storage into a server-side database. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the application provides a data storage method, and relates to the technical field of artificial intelligence. The data storage method provided by the embodiment of the application can be applied to the terminal, the server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the data storage method, but is not limited to the above form.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In the embodiments of the present application, when related processing is required according to data related to a user identity or a characteristic, such as user information, user behavior data, user history data, user location information, and the like, permission or consent of the user is obtained first. Moreover, the collection, use, processing, etc. of such data would comply with relevant laws and regulations. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.
Based on this, the multi-channel dynamic file direct transmission method in the embodiment of the application can be illustrated by the following embodiment.
Fig. 2 is an optional flowchart of a data storage method provided in an embodiment of the present application, where the method in fig. 2 may include, but is not limited to, steps S101 to S104.
Step S101, acquiring a target data set to be stored, and extracting the data type of the target data set, wherein the target data set comprises original data characteristics;
in some embodiments, the target data set may be obtained from user input or obtained from a database in which the target data set is pre-stored.
In some embodiments, the target data sets may be in various data forms such as text, audio, image or video, and each target data set may include one or more target data, where the data features of the target data are the original data features of the target data set.
Illustratively, a certain target data set is student information, which includes heights and weights of all students, and then each student corresponds to one target data, and the height information and the weight information are original data features of the target data set.
In some embodiments, the data type of the target data set may be extracted by an extraction operation, and when the target data set is stored in a particular storage format, the data type of the target data set may be extracted. For example, when a target data set is stored in a txt file, the data type of the target data set may be determined to be a text type.
In some embodiments, the data Type of the target data set may also be determined by the transmission channel, and if the data is transmitted via HTTP or HTTPs protocol, it will typically include a Content-Type field in the request Header (HTTP Header) that indicates the data Type of the request or response body. For example, when the Content-Type is "text/play", the data Type thereof is determined to be a text Type, and when the Content-Type is "image/jpeg", "image/png", "image/gif", the data Type thereof is determined to be an image Type.
Step S102, performing high-dimensional conversion on a target data set according to the data type to obtain a high-dimensional vector data set, wherein the high-dimensional vector data set comprises high-dimensional vector features;
in some embodiments, for different data types of a target data set, different high-dimensional transformation operations may be performed on the target data set, and a high-dimensional vector data set corresponding to the target data set may be generated, where the high-dimensional vector data set includes one or more high-dimensional vector data, each of which includes high-dimensional vector features.
In some embodiments, after the target data sets are high-dimensional converted to high-dimensional data sets, various machine learning algorithms may be more easily used to classify, cluster, or regress the high-dimensional vector data to more accurately reflect the differences and similarities between the target data sets. Because the high-dimensional vector data set is associated with the target data set, through the associated storage of the high-dimensional vector data set and the target data set, the corresponding high-dimensional vector data set can be obtained by searching the high-dimensional vector characteristics during searching, so that the searching range of the target data set is narrowed, and further the rapid searching of the target data set is realized.
Step S103, performing low-dimensional conversion on the low-dimensional vector data set according to the data type to obtain the low-dimensional vector data set, wherein the low-dimensional vector data set comprises low-dimensional vector features;
in some embodiments, for different data types of the target data set, different low-dimensional transformation operations may be performed on a high-dimensional vector data set, and a low-dimensional vector data set corresponding to the high-dimensional vector data set is generated, where the low-dimensional vector data set includes one or more low-dimensional vector data, and each low-dimensional vector data includes a low-dimensional vector feature.
In some embodiments, after the high-dimensional vector data set is converted into the low-dimensional data set through low-dimensional conversion, various machine learning algorithms can be more easily used for classifying, clustering or regressing the low-dimensional vector data so as to more accurately reflect the difference and the similarity between the target data sets. Because the low-dimensional vector data set is associated with the target data set, through the associated storage of the low-dimensional vector data set and the target data set, the retrieval range of the target data set can be narrowed by retrieving the low-dimensional vector data set to obtain the corresponding low-dimensional vector data set during retrieval, and further the rapid retrieval of the target data set is realized.
Step S104, respectively storing the low-dimensional vector data set, the high-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database according to a preset index structure, so that the target data set is obtained when searching according to the low-dimensional vector features or the high-dimensional vector features, wherein the high-dimensional vector features, the low-dimensional vector features and the original data features are related to each other.
In some embodiments, the low-dimensional vector data set, the high-dimensional vector data set, and the target data set are stored at different data layers of the database, respectively, and illustratively, the low-dimensional vector data set, the high-dimensional vector data set, and the target data set may be stored at the first vector layer, the second vector layer, and the original data layer, respectively.
In some embodiments, a variety of index structures may be preset, and the relevant person may choose one or more index structures to store the data set according to particular needs.
In some embodiments, common index structures include B-Tree index, hash index, full-Text index, R-Tree index, and the like.
In some embodiments, the low-dimensional vector data set, the high-dimensional vector data set, and the target data set (hereinafter, simply referred to as "data set") can be stored according to different index structures, wherein the index structures corresponding to the data sets can be different, and each storage layer can include a plurality of different index structures for storing the data sets in a plurality of different storage modes, so that when retrieving data, the data sets can be retrieved using the different index structures according to different requirements. Illustratively, the original data layer employs a B-Tree index structure, while the first and second vector layers employ two index structures, KD (K-Dimensional Tree) and LSH (Locality Sensitive Hashing, LSH) trees, for storage of data sets.
In some embodiments, the low-dimensional vector data set, the high-dimensional vector data set and the target data set may be stored in different data layers of the database, respectively, so as to achieve efficient storage of data.
In some embodiments, more data storage layers may also be provided in the database to store other transformed forms of the target data set as actually needed to achieve more efficient data management. A metadata layer may be provided, for storing metadata related to the target data set, such as the source of the data, the time of creation, the extraction or dimension reduction method used, etc., which can be used for data management and auditing.
In some embodiments, because the low-dimensional vector dataset is obtained by low-dimensional conversion of the high-dimensional vector dataset and the high-dimensional vector dataset is obtained by high-dimensional conversion of the target dataset, there is an association between the low-dimensional vector dataset, the high-dimensional vector dataset, and the target dataset.
In some embodiments, since each data layer has its own index structure, interactions and associations between different levels of index structures are typically achieved through some mapping relationship, thereby achieving associations between different data layers. In particular, the index entries in each upper level index structure may contain a pointer to an index entry in the index structure of the lower level index, so that a quick finding of the lower level index entries from the upper level index entries may be achieved. For example, an index entry of the original data layer may contain a pointer to a corresponding index entry of the second vector layer.
It can be understood that by performing high-dimensional conversion and low-dimensional conversion on the target data set, multiple modal data of the target data set are respectively stored in different data layers of the database, so that the inefficient storage caused by storing different modal data in different databases is avoided, and meanwhile, the storage mode can improve the retrieval efficiency of the target data set.
As shown in fig. 3, fig. 3 is a flowchart of one implementation of step S102 of fig. 2, and in some embodiments, step S101 may further include steps S201 to S202:
step S201, when the data type is the first type, word segmentation operation is carried out on the target data set, a word segmentation result is obtained, and high-dimensional vector mapping operation is carried out on the word segmentation result, so that a high-dimensional vector data set is obtained;
in some embodiments, the first type of data includes a text type, and when the target data set is text type data, the word segmentation operation may be performed on the target data set first to obtain a word segmentation result, and then the high-dimensional vector mapping operation may be performed on the word segmentation result to obtain a high-dimensional vector data set.
In some embodiments, for text data, text may be converted to high-dimensional vectors using Word embedding (Word embedding) or sentence embedding (Sentence Embeddings) techniques.
In some embodiments, either the Word vector algorithm Word2Vec or the Word vector algorithm GloVe may be employed, which are both mapping words into vector space such that semantically similar words are also similar in distance in vector space.
In some embodiments, a Sentence vectorization algorithm BERT or Sentence-BERT may be used, each of which maps an entire Sentence into a high-dimensional vector space to capture semantic information of the Sentence.
In some embodiments, the high-dimensional vector dataset may be preprocessed between high-dimensional transformations of the target dataset. The preprocessing may include data cleansing, format conversion, feature extraction, and the like.
Step S202, when the data type is the second type, inputting the target data set into a pre-trained conversion model, performing feature extraction operation on the target data set according to the conversion model to obtain high-dimensional vector features, and performing high-dimensional vectorization operation on the high-dimensional vector features to obtain a high-dimensional vector data set.
In some embodiments, the second type of data includes image, audio, and video data that requires processing using a data model.
In some embodiments, for image data, a convolutional neural network (Convolutional Neural Networks, CNN) may be used to convert the image into a high-dimensional vector. In the convolutional neural network training process, the CNN learned features represent various visual properties that can capture images, and in general, we can extract feature vectors from the fully connected layers (or other suitable layers) of the CNN, which serve as a high-dimensional vector representation of the images.
In some embodiments, for audio data, the audio signal may be converted to a high-dimensional vector using a deep learning method, such as a recurrent neural network (Recurrent Neural Networks, RNN) or a one-dimensional convolutional neural network (1D CNN). Illustratively, the audio signal is first pre-processed, such as extracting Mel-frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC) or other features, and then vector representations of the audio features are learned using RNN or 1D CNN.
In some embodiments, for video data, a three-dimensional convolutional neural network (3D CNN) or a Two-Stream convolutional neural network (Two-Stream CNN) may be used to convert a sequence of video frames into high-dimensional vectors. Wherein the 3D CNN is able to capture spatial and temporal information of the video, while the two-stream CNN learns spatial information (using RGB frames) and temporal information (using optical flow) of the video, respectively, and then feature vectors can be extracted from the appropriate layers of these networks for use as a high-dimensional vector representation of the video.
In some embodiments, the high-dimensional vector data set is typically represented by floating point numbers, which may require a large memory space. To reduce storage requirements, the original floating point number vector may be encoded into a more compact binary or integer vector using vector encoding techniques such as product quantization (Product Quantization, PQ) or SimHash, or the like. This encoding method can greatly reduce the storage space requirements while maintaining an approximate calculation of the similarity between vectors.
In some embodiments, the high-dimensional vector data set may also be compressed using various compression algorithms such as LZ77 (Lempel-Ziv 77), LZ78 (Lempel-Ziv 78), or Brotli, etc., to further reduce storage space requirements, and the compressed data may be stored in a database with less storage space.
As shown in fig. 4, fig. 4 is another implementation flowchart of step S102 of fig. 2, and in some embodiments, step S102 may further include steps S301 to S302:
step S301, when a target data set to be stored comprises at least two mutually related sub-target data sets, respectively carrying out high-dimensional conversion on each sub-target data set according to the data type of each sub-target data set to obtain a plurality of sub-high-dimensional vector data sets containing sub-high-dimensional vector features;
in some embodiments, the target data set may comprise sub-target data sets of more than one data type, and there may be an association between multiple sub-target data sets, and to achieve faster data retrieval, the multiple associated sub-target data sets may be all subjected to high-dimensional transformation, and resulting in sub-high-dimensional vector data sets comprising sub-high-dimensional vector features.
In some embodiments, the interrelated sub-target data sets may be multiple representations of one data, such as text type data and image type data of city weather; alternatively, there may be two different data that are associated, such as text type data for city weather and city introduction video type data.
In some embodiments, the target data set may comprise a plurality of sub-target data sets, wherein part of the sub-target data sets are associated with each other, and the other part of the target data sets are not associated with each other, and for this purpose, the sub-target data set parts with the association can be subjected to high-dimensional transformation.
Step S302, vector fusion is carried out on the plurality of sub-high-dimensional vector data sets, and a high-dimensional vector data set comprising a plurality of sub-high-dimensional vector features is obtained.
In some embodiments, vector fusion can be performed on a plurality of sub-high-dimensional vector data sets containing sub-high-dimensional vector features obtained after high-dimensional conversion, the sub-high-dimensional vector features of a plurality of sub-target data sets participating in vector fusion are included in the high-dimensional vector data sets obtained after vector fusion, so that the high-dimensional vector data sets of a plurality of different modes are fused, information of different sensing channels can be utilized, and the features of the multi-mode object can be better understood and expressed, so that the target data sets can be more accurately and rapidly retrieved when the target data sets are retrieved.
In some embodiments, the vector fusion method may include vector splicing, weighted averaging, feature crossing, linear transformation, or model fusion.
Illustratively, the target dataset comprises a sub-target dataset A of text data type and a sub-target dataset B of image data type associated with its presence data, first, the sub-target dataset A is converted into a sub-high dimensional vector dataset A using Word embedding (e.g. Word2Vec or GloVe), and we can get its vector representation as E (A) for the sub-target dataset A, assuming that the Word embedding function is E; likewise, a pre-trained convolutional neural network (e.g., VGG or ResNet) may be used to convert the sub-target dataset B into a sub-high-dimensional vector dataset B. Assuming our convolutional neural network is C, then for sub-target data set B, its vector representation is available as C (B). Then, the sub-high-dimensional vector data set a and the sub-high-dimensional vector data set B may be vector-fused to form a new vector, for example, when vector fusion is performed by using a vector stitching method, the fused high-dimensional vector data set may be obtained as [ E (a), C (B) ].
For example, after the obtained fused high-dimensional vector data set, a further vector data characteristic enhancement processing operation may be performed on the high-dimensional vector data set, for example, the fused high-dimensional vector data set may be [ E (a), C (B) ] and input into a fully connected neural network F (Fully Connected Network, FCN), and then a data result after the vector data characteristic enhancement may be obtained as F ([ E (a), C (B) ]).
As shown in fig. 5, fig. 5 is a flowchart of one implementation of step S103 of fig. 2, in some embodiments, step S103 may include steps S401 to S402:
step S401, when the data type is linear type, extracting a plurality of linear vector features from the high-dimensional vector data set, and performing low-dimensional vector mapping operation on the high-dimensional vector data set according to the linear vector features to obtain a low-dimensional vector data set;
in some embodiments, after the high-dimensional vector data set corresponding to the target data set is stored in the database, the low-dimensional conversion needs to be performed on the high-dimensional vector data set to obtain a low-dimensional vector data set, wherein the low-dimensional vector data set contains low-dimensional vector features, and the retrieval speed of the low-dimensional vector features is very rapid.
In some embodiments, the low-dimensional vector mapping operation is a dimension-reduction process, which is a process of mapping high-dimensional data into a low-dimensional space, aimed at preserving key features and structures of the data. It will be appreciated that deriving a low-dimensional vector dataset through a dimension reduction technique may help reduce computational complexity, reduce memory requirements, and improve visualization results when processing high-dimensional vectors.
In some embodiments, when the data type of the target data set further includes a linear type and a non-linear type, a plurality of linear vector features may be extracted from the high-dimensional vector data set when the target data set is of the linear type, and then, a low-dimensional vector mapping operation is performed on the high-dimensional vector data set according to the linear vector features, resulting in a low-dimensional vector data set.
In some embodiments, principal component analysis (Principal Component Analysis, PCA) is a linear dimension reduction technique that achieves dimension reduction by finding the principal direction of variance of data to project the data into a low-dimensional space consisting of principal components, PCA is typically used to process continuous data, i.e., linear-type data, and assumes that the data is linear in the low-dimensional space.
In some embodiments, linear discriminant analysis (Linear Discriminant Analysis, LDA) is a supervised dimension reduction method that attempts to find a direction that maximizes the distance between categories when projecting data into a low-dimensional space, LDA is typically used to classify problems with the goal of maximizing category separability in the space after dimension reduction.
And step S402, when the data type is nonlinear, calculating a similarity value between nonlinear vector features in the high-dimensional vector data set, and performing low-dimensional vector mapping operation on the nonlinear vector features according to the similarity value to obtain a low-dimensional vector data set.
In some embodiments, when the data type of the target data set is a nonlinear type, a nonlinear dimension reduction method is correspondingly adopted.
In some embodiments, t-distribution random neighborhood embedding (t-Distributed Stochastic Neighbor Embedding, t-SNE) is a nonlinear dimension reduction method that maps high-dimensional data to low-dimensional space while maintaining as much as possible a similarity structure in the original data, t-SNE achieves dimension reduction by minimizing KL (Kullback-Leibler) divergence between similarity distributions in the high-dimensional space and the low-dimensional space, and t-SNE is particularly useful for visualizing high-dimensional data such as images, text, and bioinformatic data.
As shown in fig. 6, fig. 6 is a flowchart of one implementation of step S104 of fig. 2, and in some embodiments, step S104 may include steps S501 to S504:
step S501, index information of a low-dimensional vector data set, a high-dimensional vector data set and a target data set is respectively extracted, and index structures corresponding to the high-dimensional vector data set, the low-dimensional vector data set and the target data set are respectively determined according to the index information;
in some embodiments, the index information may be information such as key attributes or key fields, and the index structures of the low-dimensional vector data set, the high-dimensional vector data set and the target data set are determined one by one according to the index information. Illustratively, the index information of the target data set is a data sequence number, and each data sequence number is unique, then a b+ tree may be used as the stored index structure.
In some embodiments, to achieve efficient vector retrieval, it is desirable to store high-dimensional vectors using a suitable data structure (i.e., index structure). Common data structures include KD Tree, ball Tree (Ball-Tree), VP Tree (Vantage-Point Tree), and the like. These tree structures may be organized according to distance relationships between vectors to reduce unnecessary distance computation when retrieved.
In some embodiments, to further improve retrieval efficiency, the database may be partitioned using an indexing technique. Such as using an inverted indexing technique to aggregate similar vectors together to reduce search space during retrieval. The vector database may be partitioned into smaller subsets using partitioning techniques such as locality sensitive hashing (Locality Sensitive Hashing, LSH) or k-means clustering, etc., to enable distributed storage and parallel retrieval.
Step S502, dividing the high-dimensional vector data set, the low-dimensional vector data set and the target data set into a plurality of characteristic data sets respectively, wherein the characteristic data sets comprise a plurality of vector data;
in some embodiments, after determining the corresponding index structure, a further determination of partitioning policies is required to better store the data set.
In some embodiments, the high-dimensional vector dataset, the low-dimensional vector dataset, and the target dataset are first divided into a plurality of feature datasets, respectively, the feature datasets being divided to determine a target dimension maximum variance value and a dimension division value from a plurality of groups.
Step S503, performing dimension variance calculation on the feature data of each feature data set to obtain a dimension maximum variance value, calculating an average value or a mode value of the feature data, and calculating a dimension score value according to the average value or the mode value;
In some embodiments, the correlation techniques typically employ the median of the target data as the score value, and these simple strategies may not work well when dealing with high-dimensional data or unevenly distributed data. In this regard, in the embodiment of the present application, the dimension of the maximum variance of the target data is adopted as the dividing dimension, so as to increase the variability of the subspace as much as possible.
Illustratively, if the feature data set is an n-dimensional data set D, and the objective is to select a partition dimension D, the variance of each feature data set may be calculated, and then the dimension with the largest variance, that is, the dimension with the largest variance, is selected as the partition dimension, where a specific calculation formula of the partition dimension D is shown in the following formula (1):
d=argmax_{i=1}^n*var(D_i) (1)
where D_i represents the data of the data set D in the ith dimension, var (D_i) represents the variance of D_i, argmax represents finding the dimension that maximizes the variance.
In some embodiments, the dimension division value represents a dimension division value which needs to be corresponding after determining the maximum variance value of the division dimension, wherein the dimension division value is used for correspondingly distributing the high-dimension vector data set, the low-dimension vector data set and the target data set into corresponding sub-nodes or areas, and the dimension division value determines the division position and the division mode and plays an important role in the storage and the query of the data set.
Step S504, a storage partition of each characteristic data set is obtained, and a high-dimensional vector data set, a low-dimensional vector data set and a target data set are respectively stored in a first vector layer, a second vector layer and an original data layer of the database according to corresponding index structures according to the maximum variance value, the dimension partition value and the storage partition.
In some embodiments, the memory partitions are used to characterize different memory regions, and the first vector layer, the second vector layer, and the original data layer each correspond to one or more memory partitions.
In some embodiments, after determining the maximum variance value, the dimension partition value, and the storage partition, the high-dimensional vector data set, the low-dimensional vector data set, and the target data set may be stored in the first vector layer, the second vector layer, and the original data layer of the database according to the corresponding index structures, respectively, according to the index structures of the different layers.
In some embodiments, different data layers of a database may be further divided into multiple storage partitions, where different storage partitions may correspond to different index structures, and the same data set may utilize different index structures to store the data set in different storage partitions, or different storage partitions may use the same index structures to store different data.
As shown in fig. 7, fig. 7 is a flowchart of an implementation after step S104 in fig. 2, and in some embodiments, step S104 may further include steps S601 to S602:
step S601, recording the searching times of each high-dimensional vector data set, each low-dimensional vector data set and each target data set;
in some embodiments, the number of searches refers to the frequency of data access or the query load.
In some embodiments, in each data storage layer of the database, the search times of each high-dimensional vector data set, each low-dimensional vector data set and each target data set can be recorded and stored, and the storage of the data sets is adjusted according to the search times, so that the storage of the data sets is more efficient and reasonable, and the search efficiency of the data sets is further improved.
In step S602, when the number of searches exceeds a preset search threshold, the high-dimensional vector dataset/the low-dimensional vector dataset/the target dataset are adjusted to the priority node of the corresponding index structure, wherein the priority node is used for representing the priority search position of the data.
In some embodiments, a search threshold may be preset, the search threshold being used to compare with a search number, and the storage structure may be optimized when the search number is greater than the search threshold.
In some embodiments, when the number of searches exceeds a preset search threshold, the high-dimensional vector dataset/low-dimensional vector dataset/target dataset may be adjusted to a priority node of the corresponding index structure. The index structure of the first vector layer includes three node layers, and when data is retrieved, the first node layer is usually preferentially retrieved, and when the number of times of retrieving a certain high-dimensional vector data set exceeds a preset retrieval threshold, the high-dimensional vector data set can be adjusted to the first node layer of the index structure corresponding to the first vector layer, so that the storage dynamic adjustment of the data set is completed, and the retrieval efficiency of the data set is further improved.
In some embodiments, the storage resources may also be dynamically adjusted according to load conditions. In particular, dynamic adjustment of storage resources, dynamic adjustment of computing resources, and dynamic adjustment of network resources may be included.
In some embodiments, dynamic adjustment of storage resources includes dynamic migration of data and dynamic adjustment of copies. For example, when the storage space of a certain storage node of the index structure is insufficient, a part of data of the storage node can be migrated to other storage nodes; alternatively, as the frequency of access to a data set increases, the number of copies of the data set may be increased to increase the speed of access to the data.
In some embodiments, the dynamic adjustment of the computing resources includes dynamic scheduling and dynamic expansion of the storage nodes. For example, when the computational load of a certain storage node is too high, a portion of the storage data set of that storage node may be scheduled to other nodes; alternatively, as the overall storage requirements of the database increase, new storage nodes may be added to increase the overall storage capacity of the data.
In some embodiments, the dynamic adjustment of network resources includes dynamic adjustment of network topology. For example, when the bandwidth of the network is insufficient, since the same dataset may be stored using a different index structure, when the latency of the network is too high, another index structure may be selected for retrieval to optimize the speed of retrieval.
Fig. 8 is an optional flowchart of a data retrieval method provided in an embodiment of the present application, where the method in fig. 8 may include, but is not limited to, steps S701 to S704.
Step S701, obtaining target retrieval characteristics of a target data set to be retrieved;
in some embodiments, according to the database which is obtained by the data storage method provided by the embodiment of the application and stores the low-dimensional vector data set, the high-dimensional vector data set and the target data set, the data set can be searched.
In some embodiments, the target retrieval feature may be key feature data of the high-dimensional vector dataset/the low-dimensional vector dataset/the target dataset.
Step S702, if the target retrieval feature is a low-dimensional retrieval feature, determining a low-dimensional vector data set and a low-dimensional vector feature corresponding to the low-dimensional vector data set in a first vector layer of a database according to the low-dimensional retrieval feature;
in some embodiments, when the target retrieval feature is a low-dimensional retrieval feature, a low-dimensional vector data set corresponding to the low-dimensional retrieval feature may be retrieved in a first vector layer of the database, where the low-dimensional vector data set includes the low-dimensional vector feature.
Step S703, determining a high-dimensional vector data set and a high-dimensional vector feature corresponding to the high-dimensional vector data set in a second vector layer of the database according to the low-dimensional vector feature after the high-dimensional conversion;
in some embodiments, since the low-dimensional vector dataset is converted from the high-dimensional vector dataset, after the low-dimensional vector dataset is obtained, the low-dimensional vector features corresponding to the low-dimensional vector dataset may be further subjected to high-dimensional conversion, and the high-dimensional vector features obtained by the high-dimensional conversion are searched in the second vector layer, so that the high-dimensional vector dataset corresponding to the searched low-dimensional vector dataset can be determined, where the high-dimensional vector dataset includes the high-dimensional vector features.
In step S704, a target data set corresponding to the high-dimensional vector feature is determined from the original data layer of the database according to the high-dimensional vector feature after the low-dimensional conversion, where the retrieved target data set includes the original data feature, the original data feature matches with the target retrieval feature, and there is a correlation among the high-dimensional vector feature, the low-dimensional vector feature and the original data feature.
In some embodiments, since the high-dimensional vector data set is converted from the target data set, after the high-dimensional vector data set is obtained, the high-dimensional vector features corresponding to the high-dimensional vector data set may be further subjected to low-dimensional conversion, and the original data features of the target data set obtained by the low-dimensional conversion are searched in the original data layer, so that the target data set corresponding to the searched high-dimensional vector data set is finally determined.
In some embodiments, the method includes the steps of firstly performing quick similarity search of the low-dimensional vector data set in a first vector layer, then performing accurate similarity matching of the corresponding high-dimensional vector data set in a second vector layer, and finally returning a final target data set retrieval result in an original data layer, wherein it can be understood that the advantage of a multi-level storage structure can be fully utilized through a hierarchical retrieval strategy of a database, and efficient retrieval of the target data set is realized.
The low-dimensional vector features have small data volume and low searching complexity, so that the low-dimensional vector data set corresponding to the finally searched target data set can be quickly determined through the low-dimensional searching features, the searching range can be reduced in advance, then the high-dimensional vector data set is further determined by utilizing the low-dimensional vector features after high-dimensional conversion, the searching range can be accurately searched to finally determine the target data set, and the low data set searching efficiency caused by searching the target data sets one by one is avoided.
In some embodiments, the target retrieval feature may also be a high-dimensional vector feature, where the high-dimensional vector data set may be determined directly from the second vector layer of the database, and the target data set corresponding to the high-dimensional vector feature may be determined from the original data layer of the database according to the high-dimensional vector feature after the low-dimensional conversion. Alternatively, the target retrieval feature may be an original data feature, in which case the corresponding target data set may be determined directly from the original data layer of the database.
As shown in fig. 9, fig. 9 is a schematic diagram of a functional module of data storage provided in an embodiment of the present application, and the embodiment of the present application further provides a data storage system, which may implement the above data storage method, where the data storage system includes:
An obtaining module 801, configured to obtain a target data set to be stored, and extract a data type of the target data set, where the target data set includes original data features;
a high-dimensional conversion module 802, configured to perform high-dimensional conversion on the target data set according to the data type, so as to obtain a high-dimensional vector data set, where the high-dimensional vector data set includes high-dimensional vector features;
a low-dimensional conversion module 803, configured to perform low-dimensional conversion on the high-dimensional vector data set according to the data type, so as to obtain a low-dimensional vector data set, where the low-dimensional vector data set includes low-dimensional vector features;
the storage module 804 is configured to store the low-dimensional vector data set, the high-dimensional vector data set, and the target data set in a first vector layer, a second vector layer, and an original data layer of the database, respectively, according to a preset index structure, so that when searching is performed according to the low-dimensional vector feature or the high-dimensional vector feature, the target data set is obtained, where a correlation exists among the high-dimensional vector feature, the low-dimensional vector feature, and the original data feature.
In some embodiments, the target data set may be obtained from user input or obtained from a database in which the target data set is pre-stored.
In some embodiments of the present invention, in some embodiments,
in some embodiments, the target data sets may be in various data forms such as text, audio, image or video, and each target data set may include one or target data, where the data features of the target data are the original data features of the target data set.
Illustratively, a certain target data set is student information, which includes heights and weights of all students, and then each student corresponds to one target data, and the height information and the weight information are original data features of the target data set.
In some embodiments, the data type of the target data set may be extracted by an extraction operation, and when the target data set is stored in a particular storage format, the data type of the target data set may be extracted. For example, when a target data set is stored in a txt file, the data type of the target data set may be determined to be a text type.
In some embodiments, the data Type of the target data set may also be determined by the transmission channel, and if the data is transmitted via HTTP or HTTPs protocol, it will typically include a Content-Type field in the request Header (HTTP Header) that indicates the data Type of the request or response body. For example, when the Content-Type is "text/play", the data Type thereof is determined to be a text Type, and when the Content-Type is "image/jpeg", "image/png", "image/gif", the data Type thereof is determined to be an image Type.
In some embodiments, for different data types of a target data set, different high-dimensional transformation operations may be performed on the target data set, and a high-dimensional vector data set corresponding to the target data set may be generated, where the high-dimensional vector data set includes one or more high-dimensional vector data, and each high-dimensional vector data includes a high-dimensional vector feature.
In some embodiments, after the target data sets are high-dimensional converted to high-dimensional data sets, various machine learning algorithms may be more easily used to classify, cluster, or regress the high-dimensional vector data to more accurately reflect the differences and similarities between the target data sets. Because the high-dimensional vector data set is associated with the target data set, through the associated storage of the high-dimensional vector data set and the target data set, the corresponding high-dimensional vector data set can be obtained by searching the high-dimensional vector characteristics during searching, so that the searching range of the target data set is narrowed, and further the rapid searching of the target data set is realized.
In some embodiments, for different data types of the target data set, different low-dimensional transformation operations may be performed on a high-dimensional vector data set, and a low-dimensional vector data set corresponding to the high-dimensional vector data set is generated, where the low-dimensional vector data set includes one or low-dimensional vector data, and each low-dimensional vector data includes a low-dimensional vector feature.
In some embodiments, after the high-dimensional vector data set is converted into the low-dimensional data set through low-dimensional conversion, various machine learning algorithms can be more easily used for classifying, clustering or regressing the low-dimensional vector data so as to more accurately reflect the difference and the similarity between the target data sets. Because the low-dimensional vector data set is associated with the target data set, through the associated storage of the low-dimensional vector data set and the target data set, the retrieval range of the target data set can be narrowed by retrieving the low-dimensional vector data set to obtain the corresponding low-dimensional vector data set during retrieval, and further the rapid retrieval of the target data set is realized.
In some embodiments, the low-dimensional vector data set, the high-dimensional vector data set, and the target data set are stored at different data layers of the database, respectively, and illustratively, the low-dimensional vector data set, the high-dimensional vector data set, and the target data set may be stored at the first vector layer, the second vector layer, and the original data layer, respectively.
In some embodiments, a variety of index structures may be preset, and the relevant person may choose one or more index structures to store the data set according to particular needs.
In some embodiments, common index structures include B-Tree index, hash index, full-Text index, R-Tree index, and the like.
In some embodiments, the low-dimensional vector data set, the high-dimensional vector data set, and the target data set (hereinafter, simply referred to as "data sets") can be stored according to different index structures, wherein the index structures corresponding to the data sets can be different, and each storage layer can include different index structures for storing the data sets in a plurality of different storage modes, so that when retrieving data, the data sets can be retrieved according to different requirements using the different index structures. Illustratively, the original data layer employs a B-Tree index structure, while the first and second vector layers employ two index structures, KD (K-Dimensional Tree) and LSH (Locality Sensitive Hashing, LSH) trees, for storage of data sets.
In some embodiments, the low-dimensional vector data set, the high-dimensional vector data set and the target data set may be stored in different data layers of the database, respectively, so as to achieve efficient storage of data.
In some embodiments, more data storage layers may also be provided in the database to store other transformed forms of the target data set as actually needed to achieve more efficient data management. A metadata layer may be provided, for storing metadata related to the target data set, such as the source of the data, the time of creation, the extraction or dimension reduction method used, etc., which can be used for data management and auditing.
In some embodiments, because the low-dimensional vector dataset is obtained by low-dimensional conversion of the high-dimensional vector dataset and the high-dimensional vector dataset is obtained by high-dimensional conversion of the target dataset, there is an association between the low-dimensional vector dataset, the high-dimensional vector dataset, and the target dataset.
In some embodiments, since each data layer has its own index structure, interactions and associations between different levels of index structures are typically achieved through some mapping relationship, thereby achieving associations between different data layers. In particular, the index entries in each upper level index structure may contain a pointer to an index entry in the index structure of the lower level index, so that a quick finding of the lower level index entries from the upper level index entries may be achieved. For example, an index entry of the original data layer may contain a pointer to a corresponding index entry of the second vector layer.
In some embodiments, for convenience of user use, the data storage system may provide a unified API interface and query language to support transparent access to the multi-level storage structure. In addition, the data storage system may also provide visualization tools and monitoring functions to help users understand the status of data storage and retrieval and adjust and optimize as needed.
It can be understood that by performing high-dimensional conversion and low-dimensional conversion on the target data set, multiple modal data of the target data set are respectively stored in different data layers of the database, so that the inefficient storage caused by storing different modal data in different databases is avoided, and meanwhile, the storage mode can improve the retrieval efficiency of the target data set.
The specific implementation of the data storage system is basically the same as the specific embodiment of the data storage method described above, and will not be described herein. On the premise of meeting the requirements of the embodiment of the application, the data storage system can be further provided with other functional modules so as to realize the data storage method in the embodiment.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the data storage method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.
As shown in fig. 10, fig. 10 is a schematic hardware structure of an electronic device provided in an embodiment of the present application, where the electronic device includes:
the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;
The memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes a data storage method for executing the embodiments of the present application;
an input/output interface 903 for inputting and outputting information;
the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);
a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);
wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the data storage method when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one (item)" and "a number" mean one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the above elements is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method of data storage, the method comprising:
acquiring a target data set to be stored, and extracting the data type of the target data set, wherein the target data set comprises original data characteristics;
performing high-dimensional conversion on the target data set according to the data type to obtain a high-dimensional vector data set, wherein the high-dimensional vector data set comprises high-dimensional vector features;
performing low-dimensional conversion on the high-dimensional vector data set according to the data type to obtain a low-dimensional vector data set, wherein the low-dimensional vector data set comprises low-dimensional vector features;
and respectively storing the low-dimensional vector data set, the high-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database according to a preset index structure, so that the target data set is obtained when searching according to the low-dimensional vector feature or the high-dimensional vector feature, wherein the high-dimensional vector feature, the low-dimensional vector feature and the original data feature are related to each other.
2. The data storage method according to claim 1, wherein said performing high-dimensional transformation on said target data set according to said data type to obtain a high-dimensional vector data set comprises:
when the data type is the first type, performing word segmentation operation on the target data set to obtain a word segmentation result, and performing high-dimensional vector mapping operation on the word segmentation result to obtain a high-dimensional vector data set;
and when the data type is the second type, inputting the target data set into a pre-trained conversion model, performing feature extraction operation on the target data set according to the conversion model to obtain high-dimensional vector features, and performing high-dimensional vectorization operation on the high-dimensional vector features to obtain a high-dimensional vector data set.
3. The data storage method according to claim 1, wherein said performing high-dimensional transformation on said target data set according to said data type to obtain a high-dimensional vector data set, further comprises:
when the target data set to be stored comprises at least two mutually related sub-target data sets, respectively carrying out high-dimensional conversion on each sub-target data set according to the data type of each sub-target data set to obtain a plurality of sub-high-dimensional vector data sets containing sub-high-dimensional vector features;
And vector fusion is carried out on the plurality of sub-high-dimensional vector data sets, so that a high-dimensional vector data set comprising a plurality of sub-high-dimensional vector features is obtained.
4. The data storage method according to claim 1, wherein said performing low-dimensional transformation on said high-dimensional vector data set according to said data type to obtain a low-dimensional vector data set comprises:
when the data type is linear, extracting a plurality of linear vector features from the high-dimensional vector data set, and performing low-dimensional vector mapping operation on the high-dimensional vector data set according to the linear vector features to obtain a low-dimensional vector data set;
and when the data type is a nonlinear type, calculating a similarity value between nonlinear vector features in the high-dimensional vector data set, and performing low-dimensional vector mapping operation on the nonlinear vector features according to the similarity value to obtain a low-dimensional vector data set.
5. The data storage method according to claim 1, wherein storing the low-dimensional vector data set, the high-dimensional vector data set, and the target data set in a first vector layer, a second vector layer, and an original data layer of a database, respectively, according to a predetermined index structure, comprises:
Respectively extracting index information of the low-dimensional vector data set, the high-dimensional vector data set and the target data set, and respectively determining index structures corresponding to the high-dimensional vector data set, the low-dimensional vector data set and the target data set according to the index information;
dividing the high-dimensional vector data set, the low-dimensional vector data set and the target data set into a plurality of characteristic data sets respectively, wherein the characteristic data sets comprise a plurality of characteristic data;
carrying out dimension variance calculation on the feature data of each feature data set to obtain a dimension maximum variance value, calculating an average value or a mode value of the vector data, and obtaining a dimension score value according to the average value or the mode value;
and acquiring storage partitions of each characteristic data set, and storing the high-dimensional vector data set, the low-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database according to the corresponding index structures according to the maximum variance value, the dimension partition value and the storage partitions.
6. The data storage method according to claim 1, further comprising, after the storing the low-dimensional vector data set, the high-dimensional vector data set, and the target data set in the first vector layer, the second vector layer, and the original data layer of the database, respectively, according to a predetermined index structure:
Recording the search times of each high-dimensional vector data set, each low-dimensional vector data set and each target data set;
and when the search times exceed a preset search threshold, adjusting the high-dimensional vector data set/the low-dimensional vector data set/the target data set to a priority node of a corresponding index structure, wherein the priority node is used for representing a priority search position of data.
7. A method of data retrieval, the method comprising:
acquiring target retrieval characteristics of a target data set to be retrieved;
if the target retrieval feature is a low-dimensional retrieval feature, determining a low-dimensional vector data set and a low-dimensional vector feature corresponding to the low-dimensional vector data set in a first vector layer of a database according to the low-dimensional retrieval feature;
determining a high-dimensional vector data set and high-dimensional vector features corresponding to the high-dimensional vector data set in a second vector layer of a database according to the low-dimensional vector features after high-dimensional conversion;
and determining a target data set corresponding to the high-dimensional vector feature from an original data layer of a database according to the high-dimensional vector feature after the low-dimensional conversion, wherein the retrieved target data set comprises the original data feature, the original data feature is matched with the target retrieval feature, and the high-dimensional vector feature, the low-dimensional vector feature and the original data feature are associated with each other.
8. A data storage system, the system comprising:
the device comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a target data set to be stored and extracting the data type of the target data set, wherein the target data set comprises original data characteristics;
the high-dimensional conversion module is used for carrying out high-dimensional conversion on the target data set according to the data type to obtain a high-dimensional vector data set, wherein the high-dimensional vector data set comprises high-dimensional vector features;
the low-dimensional conversion module is used for carrying out low-dimensional conversion on the high-dimensional vector data set according to the data type to obtain a low-dimensional vector data set, wherein the low-dimensional vector data set comprises low-dimensional vector features;
the storage module is used for respectively storing the low-dimensional vector data set, the high-dimensional vector data set and the target data set in a first vector layer, a second vector layer and an original data layer of a database according to a preset index structure so as to obtain the target data set when searching according to the low-dimensional vector feature or the high-dimensional vector feature, wherein the high-dimensional vector feature, the low-dimensional vector feature and the original data feature are related to each other.
9. An electronic device comprising a memory storing a computer program and a processor implementing the data storage method of any one of claims 1 to 6 or the data retrieval method of claim 7 when the computer program is executed by the processor.
10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the data storage method of any one of claims 1 to 6 or the data retrieval method of claim 7.
CN202311182707.4A 2023-09-12 2023-09-12 Data storage method, retrieval method, system, device and storage medium Pending CN117390013A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311182707.4A CN117390013A (en) 2023-09-12 2023-09-12 Data storage method, retrieval method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311182707.4A CN117390013A (en) 2023-09-12 2023-09-12 Data storage method, retrieval method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN117390013A true CN117390013A (en) 2024-01-12

Family

ID=89462070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311182707.4A Pending CN117390013A (en) 2023-09-12 2023-09-12 Data storage method, retrieval method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN117390013A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101140583A (en) * 2007-10-09 2008-03-12 华为技术有限公司 Text searching method and device
CN104318046A (en) * 2014-08-18 2015-01-28 南京大学 System and method for incrementally converting high dimensional data into low dimensional data
CN108629345A (en) * 2017-03-17 2018-10-09 北京京东尚科信息技术有限公司 Dimensional images feature matching method and device
US20180349735A1 (en) * 2015-07-23 2018-12-06 Beijing Jingdong Shangke Information Technology Co Ltd. Method and Device for Comparing Similarities of High Dimensional Features of Images
CN115129949A (en) * 2022-06-30 2022-09-30 上海徐毓智能科技有限公司 Vector range retrieval method, device, equipment, medium and program product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101140583A (en) * 2007-10-09 2008-03-12 华为技术有限公司 Text searching method and device
CN104318046A (en) * 2014-08-18 2015-01-28 南京大学 System and method for incrementally converting high dimensional data into low dimensional data
US20180349735A1 (en) * 2015-07-23 2018-12-06 Beijing Jingdong Shangke Information Technology Co Ltd. Method and Device for Comparing Similarities of High Dimensional Features of Images
CN108629345A (en) * 2017-03-17 2018-10-09 北京京东尚科信息技术有限公司 Dimensional images feature matching method and device
CN115129949A (en) * 2022-06-30 2022-09-30 上海徐毓智能科技有限公司 Vector range retrieval method, device, equipment, medium and program product

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
US8498455B2 (en) Scalable face image retrieval
CN111125422A (en) Image classification method and device, electronic equipment and storage medium
EP3138051A1 (en) Learning multimedia semantics from large-scale unstructured data
WO2023108993A1 (en) Product recommendation method, apparatus and device based on deep clustering algorithm, and medium
JP2016540332A (en) Visual-semantic composite network and method for forming the network
CN114329029B (en) Object retrieval method, device, equipment and computer storage medium
CN112149410A (en) Semantic recognition method and device, computer equipment and storage medium
TW202001621A (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
CN114627282A (en) Target detection model establishing method, target detection model application method, target detection model establishing device, target detection model application device and target detection model establishing medium
CN114490923A (en) Training method, device and equipment for similar text matching model and storage medium
CN114329004A (en) Digital fingerprint generation method, digital fingerprint generation device, data push method, data push device and storage medium
CN116821307B (en) Content interaction method, device, electronic equipment and storage medium
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN116578729B (en) Content search method, apparatus, electronic device, storage medium, and program product
WO2023168997A1 (en) Cross-modal retrieval method and related device
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN112084338A (en) Automatic document classification method, system, computer equipment and storage medium
CN116775875A (en) Question corpus construction method and device, question answering method and device and storage medium
CN115017263B (en) Article recommendation method, article recommendation device, electronic device and storage medium
CN115145980B (en) Dialogue reply generation method and device, electronic equipment and storage medium
KR102590388B1 (en) Apparatus and method for video content recommendation
CN113449522A (en) Text fuzzy matching method and device
Bouhlel et al. Hypergraph learning with collaborative representation for image search reranking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination