WO2023019948A1 - Retrieval method, management method, and apparatuses for multimodal information base, device, and medium - Google Patents

Retrieval method, management method, and apparatuses for multimodal information base, device, and medium Download PDF

Info

Publication number
WO2023019948A1
WO2023019948A1 PCT/CN2022/082949 CN2022082949W WO2023019948A1 WO 2023019948 A1 WO2023019948 A1 WO 2023019948A1 CN 2022082949 W CN2022082949 W CN 2022082949W WO 2023019948 A1 WO2023019948 A1 WO 2023019948A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
modal
feature
multimodal
target
Prior art date
Application number
PCT/CN2022/082949
Other languages
French (fr)
Chinese (zh)
Inventor
魏翔
孙逸鹏
姚锟
韩钧宇
丁二锐
刘经拓
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023019948A1 publication Critical patent/WO2023019948A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/908Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, in particular to the field of computer vision and deep learning technology, which can be applied to scenarios such as image recognition and image search, and specifically relates to a retrieval method, management method, device, Electronic devices, computer readable storage media and computer program products.
  • Artificial intelligence is a discipline that studies the use of computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level.
  • Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing; artificial intelligence software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge map technology and other major directions.
  • the present disclosure provides a retrieval method, management method, device, electronic equipment, computer readable storage medium and computer program product for a multimodal information base.
  • a retrieval method for a multimodal information base includes multiple pieces of target information, and each piece of target information includes first modality information and second modality information information
  • the method includes: in response to receiving the retrieval information including the first modality information, using a first multimodal feature extraction module to extract the first modality feature of the retrieval information from the first modality information of the retrieval information; Based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each of the multiple pieces of target information, selecting a first group of the multiple pieces of target information Target information, wherein the first modality feature of each target information is extracted from the first modality information of the target information using the first multimodal feature extraction module, and the second modality feature of each target information is using the second multimodal feature extraction module to extract from the second modality information of the target information; and generating retrieval results based on the first set of target information.
  • a management method for a multimodal information base comprising: in response to receiving storage information including first modality information and second modality information, using the first A multi-modal feature extraction module extracts the first modal feature of the warehousing information from the first modal information of the warehousing information, and uses the second multi-modal feature extraction module to extract the second modal feature of the warehousing information
  • the second modal feature of the storage information is extracted from the warehousing information; based on the first modal feature and the second modal feature of the storage information, the multi-modal feature of the storage information is calculated; based on the first modal feature of the storage information features, second modality features, and multimodal features, generating one or more search objects corresponding to the incoming information in the multimodal information base; and in response to receiving the search information, performing a search according to the present disclosure method.
  • a retrieval device for a multimodal information base includes a plurality of target information including first modality information and second modality information
  • the The device includes: a retrieval feature extraction module, configured to: in response to receiving the retrieval information including the first modality information, use the first multimodal feature extraction module to extract the information of the retrieval information from the first modality information of the retrieval information The first modal feature; the target matching module, configured to: based on the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information , select the first group of target information among multiple pieces of target information, wherein the first modal feature of each piece of target information is obtained from the first modal information of the target information using the first multi-modal feature extraction module Extracted, the second modal feature of each piece of target information is extracted from the second modal information of the target information by using the second multi-modal feature extraction module; and the retrieval result
  • a management device for a multimodal information base including: a storage information extraction module configured to: respond to receiving information including the first modality and the second modality For the storage information of information, use the first multimodal feature extraction module to extract the first modal feature of the storage information from the first modal information of the storage information, and use the second multimodal feature extraction module, Extract the second modal feature of the warehousing information from the second modal information of the warehousing information; the multi-modal information generation module is configured to: based on the first modal feature and the second modal feature of the warehousing information, Calculate the multimodal features of the storage information; the retrieval object generation module is configured to: based on the first modal features, second modal features and multimodal features of the warehousing information, generate the corresponding One or more retrieval objects of the stored information; and the retrieval device for the multimodal information database as described in the present disclosure.
  • an electronic device including: at least one processor; and a memory communicated with the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by Execution by at least one processor, so that at least one processor can execute the retrieval method and/or management method as described in the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the retrieval method and/or management method as described in the present disclosure.
  • a computer program product including a computer program, wherein the computer program implements the retrieval method and/or the management method as described in the present disclosure when executed by a processor.
  • various modal information in the multi-modal information base can be retrieved, avoiding the problem of inconsistency between different modal information of the same target information in the multi-modal information base.
  • FIG. 1 shows a schematic diagram of an exemplary system in which various methods described herein may be implemented according to an embodiment of the present disclosure
  • FIG. 2 shows a flow chart of a retrieval method for a multimodal information base according to an embodiment of the present disclosure
  • FIG. 3 shows a flow chart of a retrieval method for a multimodal information base according to an embodiment of the present disclosure
  • FIG. 4 shows a flow chart of a retrieval method for a multimodal information base according to an embodiment of the present disclosure
  • FIG. 5 shows a flow chart of a management method for a multimodal information base according to an embodiment of the present disclosure
  • Fig. 6 shows a flow chart of a management method for a multimodal information base according to an embodiment of the present disclosure
  • FIG. 7 shows a flowchart of an example process of extracting single-modal image features of warehouse-in information from one or more pieces of subject information of warehouse-in information in the method of FIG. 6 according to an embodiment of the present disclosure
  • Fig. 8 shows a structural block diagram of a retrieval device for a multimodal information base according to an embodiment of the present disclosure
  • Fig. 9 shows a structural block diagram of a management device for a multimodal information base according to an embodiment of the present disclosure
  • FIG. 10 shows a structural block diagram of an exemplary electronic device that can be used to implement the embodiments of the present disclosure.
  • first, second, etc. to describe various elements is not intended to limit the positional relationship, temporal relationship or importance relationship of these elements, and such terms are only used for Distinguishes one element from another.
  • first element and the second element can refer to the same instance of the element, and in some cases, they can also refer to different instances based on the description of the context.
  • FIG. 1 shows a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented according to an embodiment of the present disclosure.
  • the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks coupling the one or more client devices to the server 120 110.
  • Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.
  • the server 120 may run one or more services or software applications enabling execution of the retrieval method and/or management method for the multimodal information base as described in the present disclosure.
  • server 120 may also provide other services or software applications that may include non-virtualized environments and virtualized environments.
  • these services may be provided as web-based services or cloud services, such as under a software-as-a-service (SaaS) model to users of client devices 101, 102, 103, 104, 105, and/or 106 .
  • SaaS software-as-a-service
  • server 120 may include one or more components that implement the functions performed by server 120 . These components may include software components, hardware components or combinations thereof executable by one or more processors. Users operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client application programs to interact with server 120 to utilize the services provided by these components. It should be understood that various different system configurations are possible, which may differ from system 100 . Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein, and is not intended to be limiting.
  • a user may use a client device 101, 102, 103, 104, 105, and/or 106 to retrieve target information in a multimodal information repository (e.g., upload retrieved information), or to add target information to a multimodal repository ( For example, upload incoming library information).
  • a client device may provide an interface that enables a user of the client device to interact with the client device. The client device can also output information to the user via the interface.
  • FIG. 1 depicts only six client devices, those skilled in the art will understand that the present disclosure can support any number of client devices.
  • Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computing devices, such as portable handheld devices, general-purpose computers (such as personal computers and laptops), workstation computers, wearable devices, Gaming systems, thin clients, various messaging devices, sensors or other sensing devices, etc. These computer devices can run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux or Linux-like operating systems (such as GOOGLE Chrome OS); or include various mobile operating systems , such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android.
  • Portable handheld devices may include cellular phones, smart phones, tablet computers, personal digital assistants (PDAs), and the like.
  • Wearable devices can include head-mounted displays and other devices. Gaming systems may include various handheld gaming devices, Internet-enabled gaming devices, and the like.
  • a client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (eg, email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.
  • communication applications eg, email applications
  • SMS Short Message Service
  • Network 110 can be any type of network known to those skilled in the art that can support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, and the like.
  • the one or more networks 110 may be a local area network (LAN), an Ethernet-based network, a token ring, a wide area network (WAN), the Internet, a virtual network, a virtual private network (VPN), an intranet, an extranet, Public switched telephone network (PSTN), infrared network, wireless network (eg Bluetooth, WIFI) and/or any combination of these and/or other networks.
  • LAN local area network
  • Ethernet-based network a token ring
  • WAN wide area network
  • VPN virtual private network
  • PSTN Public switched telephone network
  • WIFI wireless network
  • Server 120 may include one or more general purpose computers, dedicated server computers (e.g., PC (personal computer) servers, UNIX servers, midrange servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination .
  • Server 120 may include one or more virtual machines running virtual operating systems, or other computing architectures involving virtualization (eg, one or more flexible pools of logical storage devices that may be virtualized to maintain the server's virtual storage devices).
  • server 120 may run one or more services or software applications that provide the functionality described below.
  • Computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems.
  • Server 120 may also run any of a variety of additional server applications and/or middle-tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.
  • server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101 , 102 , 103 , 104 , 105 , and 106 .
  • Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106 (in some embodiments , the server 120 may include one or more application programs, for example, application programs based on services such as image, video, voice, text, digital signal, etc. , 103, 104, 105, and 106 receive task requests such as voice interaction, text classification, image recognition, or key point detection.
  • the server can use training samples to train the neural network model according to the specific deep learning task, and can use the training samples for the neural network model.
  • Each sub-network in the super network module is tested, and according to the test results of each sub-network, the structure and parameters of the neural network model for performing deep learning tasks can be determined.
  • Various data can be used as training sample data for deep learning tasks, such as Image data, audio data, video data or text data, etc.
  • the server 120 can also automatically search for the optimal model structure through model search technology to perform corresponding tasks).
  • the server 120 may be a server of a distributed system, or a server combined with blockchain.
  • the server 120 can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.
  • Cloud server is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability existing in traditional physical host and virtual private server (VPS, Virtual Private Server) services.
  • System 100 may also include one or more databases 130 .
  • these databases may be used to store data and other information.
  • databases 130 may be used to store information such as audio files and video files.
  • Data repository 130 may reside in various locations.
  • the data store used by server 120 may be local to server 120, or may be remote from server 120 and may communicate with server 120 via a network-based or dedicated connection.
  • Data repository 130 can be of different types.
  • the data store used by server 120 may be a database, such as a relational database.
  • One or more of these databases may store, update and retrieve the database and data from the database in response to commands.
  • databases 130 may also be used by applications to store application data.
  • Databases used by applications can be different types of databases such as key-value stores, object stores or regular stores backed by a file system.
  • the system 100 of FIG. 1 may be configured and operated in various ways to enable application of the various methods and apparatuses described in accordance with this disclosure.
  • the traditional retrieval information base is retrieved based on the text keywords or image content of the target information, therefore, it is expected to provide a retrieval method based on various modal information (such as image information and text information) of the target information , to avoid inconsistencies between multiple modal information of the same target information.
  • various modal information such as image information and text information
  • An embodiment of the present disclosure provides a retrieval method for a multimodal information base, wherein the multimodal information base includes multiple pieces of target information, and each piece of target information includes first modality information and second modality information,
  • the method includes: in response to receiving retrieval information including the first modality information, using a first multimodal feature extraction module to extract the first modality feature of the retrieval information from the first modality information of the retrieval information; the similarity between the first modal feature of the information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information, and select the first group of target information in the multiple pieces of target information , wherein the first modal feature of each piece of target information is extracted from the first modal information of the target information using the first multimodal feature extraction module, and the second modal feature of each piece of target information is extracted using the first multimodal feature extraction module
  • the second multimodal feature extraction module extracts from the second modality information of the target information; and generates a retrieval result based on the first set of
  • FIG. 2 shows a flowchart of a retrieval method 200 for a multimodal information base according to an embodiment of the present disclosure.
  • the multimodal information base includes multiple pieces of target information, and each piece of target information includes first modality information and second modality information.
  • the first multimodal feature extraction module is used to extract the first modality feature of the retrieval information from the first modality information of the retrieval information.
  • the retrieval information is a retrieval request sent by the client to the server.
  • the client For example, when the user wishes to retrieve a certain skirt seen offline on the Internet, he can input "dress" or take a picture of the skirt he saw on the client. image, and send a corresponding retrieval request to the server through the client.
  • the first modality information of retrieved information may be text information or image information, wherein, when the first modality information of retrieved information is text information, a multimodal text extraction module (for example, Bert Base network ) to extract the multimodal text features of the retrieved information, when the first modality information of the retrieved information is image information, a multimodal image extraction module (for example, ViT Base network) is used to extract the multimodal image features of the retrieved information.
  • a multimodal text extraction module for example, Bert Base network
  • a multimodal image extraction module for example, ViT Base network
  • multiple pieces of target information are selected based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information.
  • the first group of target information in , wherein the first modality feature of each target information is extracted from the first modality information of the target information using the first multi-modal feature extraction module, and the first modality feature of each target information
  • the two-modal feature is extracted from the second modality information of the target information by using the second multi-modal feature extraction module.
  • selecting the first group of target information among the multiple pieces of target information includes: based on the similarity between the first modality feature of the retrieved information and the first modality feature of each piece of target information among the multiple pieces of target information, selecting A second set of target information in the multiple pieces of target information; and based on the similarity between the first modality feature of the retrieved information and the second modality feature of each target information in the second set of targets, from the second set of target information Select the first set of target information.
  • selecting the first group of target information among the multiple pieces of target information includes: based on the similarity between the first modality feature of the retrieved information and the first modality feature of each piece of target information and the second modality of the retrieved information calculating the similarity between the modal features and the second modal features of each piece of target information, calculating a retrieval score for each piece of target information, and selecting a first group of target information from multiple pieces of target information based on the retrieval scores.
  • the multiple pieces of target information are sorted, and the multiple pieces of target information are selected.
  • the first predetermined number of pieces of target information with the highest similarity are used as the second group of target information.
  • target information whose similarity between the first modal feature of the retrieved information and the first modal feature is greater than a first similarity threshold is selected from the multiple pieces of target information as the second group of target information.
  • the second group of object information is sorted based on the similarity between the second modality feature of the retrieved information and the second modality feature of each piece of object information in the plurality of pieces of object information, and the second group of objects is selected
  • the first second predetermined number of pieces of target information with the highest similarity in the information are used as the first group of target information.
  • target information whose similarity between the second modal feature of the retrieved information and the second modal feature is greater than a second similarity threshold is selected from the second group of target information as the first set of target information.
  • all object information in the second group of object information may be selected as the first group of object information, and only according to the second modality of retrieving information The similarity between the feature and its second modality feature ranks these target information.
  • the similarity between the first modal feature of the retrieved information and the first modal feature of each piece of target information and/or the similarity between the second modal feature of the retrieved information and the second modal feature of each piece of target information is cosine similarity.
  • the first modality information is any one of image information and text information
  • the second modality information is the other of image information and text information
  • a retrieval result is generated based on the first group of target information.
  • the similarity degree determines the ranking order of the target information in the retrieval results.
  • calculate a retrieval score for each piece of target information calculates a retrieval score for each piece of target information, and, based on the retrieval score, generate a retrieval result corresponding to the first group of target information.
  • the retrieval method for the multimodal information base as described in the embodiments of the present disclosure, even if the user only inputs single-modal retrieval information, it is possible to perform retrieval based on multiple modal information of the target information, avoiding multimodal
  • the problem of inconsistency between different modal information of the same target information in the modal information database for example, the image information of a certain target information does not match the text information.
  • the first modality information is image information
  • the retrieval method for a multimodal information base further includes: before selecting the first group of target information among the plurality of pieces of target information: using subject detection module, extracting one or more pieces of subject information from the first modal information of the retrieved information; for each piece of subject information, using an image feature extraction module to extract the single-modal image features of the subject information from the subject information; and based on The similarity between the unimodal image features of one or more pieces of subject information and the unimodal image features of each piece of target information in the multiple pieces of target information, select the third group of target information in the multiple pieces of target information, where each The single-modal image feature of the target information is extracted from the first modality information of the target information by using the image feature extraction module, and wherein, selecting the first group of target information among the multiple pieces of target information includes: for the third group of targets For each piece of target information in the information, based on the similarity between the first modal feature of the retrieved information and the first
  • FIG. 3 shows a flowchart of a retrieval method 300 for a multimodal information base according to an embodiment of the present disclosure.
  • step S301 in response to receiving the retrieval information including the first modality information, use the first multimodal feature extraction module to extract the first modality feature of the retrieval information from the first modality information of the retrieval information, wherein , the first modality information is image information.
  • step S301 may be performed similarly to step S201 in FIG. 2 .
  • step S303 use the subject detection module to extract one or more pieces of subject information from the first modality information of the retrieved information.
  • a target detector for example, YOLO-v3 is used to perform target detection on the first modality information (ie, image information) of the retrieved information, wherein, for the detected detection frame , to filter out the detection frames with higher confidence and appropriate size and position (for example, filter out the detection frames with lower confidence, smaller size and closer to the border of the picture), and extract the corresponding The information based on the selected detection frame is used as one or more pieces of subject information.
  • the detection frames with smaller size and closer to the picture boundary may be filtered out from the detection frames first, and then the two detection frames with the highest confidence among the remaining detection frames are selected as the filtered detection frames.
  • step S305 for each piece of subject information, use an image feature extraction module to extract the single-modal image features of the subject information from the subject information.
  • the image feature extraction module has the same structure as the first multimodal feature extraction module.
  • the training of the first multimodal feature extraction module is performed, and the trained parameters of the first multimodal training extraction module are used as the initialization parameters of the image feature extraction module, so as to Carry out the training of the image feature extraction module (for example, based on the ID-labeled image data, fine-tuning by means of metric learning).
  • the time for training the image feature extraction module is shortened by using the trained parameters of the first multimodal training extraction module as the initialization parameters of the image feature extraction module.
  • a third group of targets in the multiple pieces of target information is selected information.
  • the one or more pieces of subject information include a plurality of pieces of subject information
  • selecting a third group of target information among the plurality of pieces of target information includes: for each piece of subject information, based on the single-modal image feature of the subject information the similarity with the single-modal image feature of each piece of target information in the target information, selecting multiple pieces of target information corresponding to the subject information among the multiple pieces of target information; and selecting multiple pieces of target information corresponding to each piece of subject information, as the third group of target information.
  • multiple pieces of target information corresponding to each piece of information are aggregated and deduplicated to obtain a third set of target information.
  • the one or more pieces of subject information include one piece of subject information
  • selecting a third group of target information among the plurality of pieces of target information includes: based on the single-modal image features of the subject information and each of the target information The similarity of the single-modal image features of the pieces of target information is selected, and the multiple pieces of target information corresponding to the subject information are selected as the third group of target information.
  • the multiple pieces of target information are sorted based on the similarity between the unimodal image features of the subject information and the unimodal image features of each piece of target information in the multiple pieces of target information, And select the first third predetermined number of pieces of target information with the highest similarity among the multiple pieces of target information.
  • step S309 for each piece of target information in the third group of target information, based on the similarity between the first modal feature of the retrieved information and the first modal feature of the target information and the relationship between the first modal feature of the retrieved information and The similarity of the second modal feature of the target information is used to calculate the similarity score of the target information.
  • the degree of similarity between the first modal feature of the retrieved information and the first modal feature of the target information and the similarity between the first modal feature of the retrieved information and the second modal feature of the target information are selected.
  • the maximum value of is used as the similarity score of the target information.
  • the first group of object information is selected from the third group of object information based on the similarity score of each piece of object information in the third group of object information.
  • a second predetermined number of pieces of object information with higher similarity scores are selected from the third group of object information.
  • the target information whose similarity score in the third group of target information is higher than the similarity threshold is selected as the first group of target information.
  • all object information in the third group of object information is selected as the first group of object information, and the information in the first group of object information is sorted based on similarity scores.
  • a search result is generated based on the first group of target information.
  • the retrieval result is generated based on the similarity score of the first group of target information.
  • steps S303-S307 are performed between steps S301 and S309.
  • steps S301-S309 may also be performed in other order, for example, steps S303-S307 are performed first, then step S301 is performed, and then step S309 is performed.
  • the single-modal image features are extracted from the subject information in the image information, and the single-modal image features based on the subject information are combined with multiple
  • the similarity of the single-modal image features of each piece of target information in the target information preliminarily screens the third group of target information among multiple targets, improving the retrieval accuracy when the user only inputs image information.
  • the retrieval method for a multimodal information base as described in the present disclosure further includes, before generating retrieval results based on the first set of target information: in response to receiving For the retrieved information of the modal information, the first multimodal feature extraction module is used to extract the first modal feature of the retrieved information, and the second multimodal feature extraction module is used to extract the second modal feature of the retrieved information; based on The first modal feature and the second modal feature of the retrieved information are generated to generate a multi-modal feature of the retrieved information; and based on the multi-modal feature of the retrieved information and the multi-modal feature of each piece of target information
  • the similarity is to select a first group of target information among multiple pieces of target information, wherein the multimodal features of each piece of target information are generated based on the first and second modal features of the target information.
  • FIG. 4 shows a flowchart of a retrieval method 400 for a multimodal information base according to an embodiment of the present disclosure.
  • step S401 it is judged whether the received retrieval information includes the first modality information and the second modality information, wherein, in response to the judgment result being “No”, proceed to step S403, in response to the judgment result being “Yes”, Proceed to step S407.
  • step S403 in response to receiving the retrieval information including the first modality information, the first multimodal feature extraction module is used to extract the first modality feature of the retrieval information from the first modality information of the retrieval information.
  • step S403 may be performed similarly to step S201.
  • step S405 multiple pieces of target information are selected based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information
  • step S405 may be performed similarly to step S203.
  • the first multimodal feature extraction module is used to extract the first modal feature of the retrieved information
  • the second multimodal feature extraction module is used to extract the second modal feature of the retrieved information.
  • a multi-modal feature of the retrieved information is generated.
  • generating the multimodal feature of the retrieved information includes: for each of the first and second modal features of the retrieved information, multiplying the modal feature by the weight corresponding to the modal feature , to obtain the product corresponding to the modal feature; and normalize the sum of the products corresponding to the first modal feature and the second modal feature of the retrieved information to obtain the multimodal feature of the retrieved information.
  • a first group of target information among the multiple pieces of target information is selected.
  • a search result is generated based on the first group of target information.
  • step S413 when the retrieved information includes the first modality information, step S413 may be performed similarly to step S205.
  • the search information when the search information includes the first modality information and the second modality information, it may be based on the multimodal features of the search information and the multimodal features of each piece of target information in the first group of target information similarity to generate search results.
  • the first multimodal feature extraction module and the second multimodal feature extraction module are obtained by training based on a loss function, wherein the loss function is obtained by the first multimodal feature extraction module and the second multimodal feature extraction module.
  • the loss function is obtained by the first multimodal feature extraction module and the second multimodal feature extraction module.
  • the first multimodal feature extraction module and the second multimodal feature extraction module can be combined
  • the modal feature extraction module is trained together to shorten the distance between the modal features of the samples whose modal information matches each other, and lengthen the distance between the modal features of the samples whose modal information does not match.
  • An embodiment of the present disclosure also provides a management method for a multimodal information library, the method includes: in response to receiving storage information including first and second modality information, using the first multimodal
  • the modal feature extraction module extracts the first modal feature of the warehousing information from the first modal information of the warehousing information, and uses the second multi-modal feature extraction module to extract the information from the second modal information of the warehousing information Extract the second modal features of the storage information; calculate the multi-modal features of the storage information based on the first and second modal features of the storage information; For bimodal features and multimodal features, generate one or more retrieval objects corresponding to the stored information in the multimodal information base; and execute the retrieval method as described in the present disclosure in response to receiving retrieval information.
  • Fig. 5 shows a flowchart of a management method 500 for a multimodal information base according to an embodiment of the present disclosure.
  • step S501 in response to receiving the storage information including the first modality information and the second modality information, use the first multimodal feature extraction module to extract the storage information from the first modality information of the storage information The first modal feature of the information, and using the second multi-modal feature extraction module, extracts the second modal feature of the storage information from the second modal information of the storage information.
  • step S503 based on the first modal feature and the second modal feature of the warehousing information, the multi-modal features of the warehousing information are calculated.
  • calculating the multimodal feature of the storage information includes: for each of the first and second modal features of the storage information, multiplying the modal feature by the modal feature corresponding to the modal feature The weight corresponding to the modal feature is obtained to obtain the product corresponding to the modal feature; and the sum of the products corresponding to the first modal feature and the second modal feature of the storage information is normalized to obtain the multi-modal feature of the retrieval information.
  • step S505 based on the first modality feature, the second modality feature and the multimodal feature of the warehouse-in information, one or more retrieval objects corresponding to the warehouse-in information in the multimodal information base are generated.
  • each feature of the storage information is added to the corresponding index file of the multimodal information base, for example, the first modality feature is added to the retrieval file corresponding to the first modality feature, and the second Two-modal features are added to the search file corresponding to the second-modal feature, so that each feature can be searched independently.
  • a retrieval object corresponding to the storage information is created, wherein the retrieval object includes the corresponding feature, ID, relevant network link, etc. of the storage information.
  • the target information corresponding to the retrieval information is retrieved in the multimodal information base.
  • the retrieval method as described in the present disclosure is executed to retrieve target information corresponding to the retrieval information in the modal retrieval library.
  • the first modality information is any one of image information and text information
  • the second modality information is the other of image information and text information
  • the management method of the state information base also includes, before generating one or more retrieval objects corresponding to the storage information in the multi-modal information base: using the subject detection module to extract the content of the storage information from the image information of the storage information One or more pieces of subject information; and using the first multimodal feature extraction module, the second multimodal feature extraction module and the image feature extraction module to extract the storage information from one or more pieces of subject information of the storage information.
  • Modal image features and, wherein, generating one or more retrieval objects corresponding to the storage information in the multi-modal information base includes: the first modality feature, the second modality feature, the multi-modality feature based on the storage information features and single-modal image features, and generate one or more retrieval objects corresponding to the stored information in the multi-modal information base.
  • Fig. 6 shows a flowchart of a management method 600 for a multimodal information base according to an embodiment of the present disclosure.
  • step S601 in response to receiving the storage information including the first modality information and the second modality information, use the first multimodal feature extraction module to extract the storage information from the first modality information of the storage information The first modal feature of the information, and using the second multi-modal feature extraction module, extracts the second modal feature of the storage information from the second modal information of the storage information.
  • step S601 may be performed similarly to step S501.
  • step S603 based on the first modal feature and the second modal feature of the warehousing information, the multi-modal features of the warehousing information are calculated. According to some embodiments, step S603 may be performed similarly to step S503.
  • step S605 use the subject detection module to extract one or more pieces of subject information of the warehouse-in information from the image information of the warehouse-in information.
  • one or more pieces of subject information can be extracted from the first modality information of the retrieved information in a manner similar to that described with reference to step S303 using the subject detection module, from the image information of the storage information, Extract one or more pieces of subject information of the storage information.
  • step S607 use the first multi-modal feature extraction module, the second multi-modal feature extraction module and the image feature extraction module to extract a single-modal image of the storage information from one or more pieces of subject information of the storage information feature.
  • step S609 based on the first modality feature, the second modality feature, the multimodal feature and the single modality image feature of the warehouse-in information, one or more Retrieve the object.
  • each feature of the storage information is added to the corresponding index file of the multi-modal information base, and, in each retrieval file of the multi-modal information base, a Corresponds to the retrieval object of the storage information.
  • the target information corresponding to the retrieval information is retrieved in the multimodal information base.
  • the retrieval method as described in the present disclosure is executed to retrieve target information corresponding to the retrieval information in the modal retrieval library.
  • steps S603-S607 are executed between steps S601 and S609.
  • steps S601-S609 may also be executed in other order, for example, step S603-S607 is executed first, then step S601 is executed, and then step S609 is executed.
  • the first multimodal feature extraction module is any one of a multimodal image extraction module and a multimodal text extraction module
  • the second multimodal feature extraction module is a multimodal image extraction module and a multimodal text extraction module. Another one in the dynamic text extraction module.
  • extracting the single-modal image feature of the warehouse-in information from one or more pieces of subject information of the warehouse-in information includes: for each of the image information and one or more pieces of subject information of the warehouse-in information, using The multimodal image extraction module extracts the multimodal image features of the information; uses the multimodal text extraction module to extract the multimodal text features of the storage information from the text information of the storage information; for the image storage information For each of the information and one or more pieces of subject information, calculate the similarity between the multimodal image features of the information and the multimodal text features of the stored information as the similarity score of the information; from the image of the stored information From the information and one or more pieces of subject information, select the information with the largest similarity score; and use the image feature extraction module to extract the single-modal image features of the stored information from the information with the largest similarity score.
  • FIG. 7 shows a flow chart of an example process of extracting single-modal image features of warehouse-in information from one or more pieces of subject information of warehouse-in information (step S607) in the method of FIG. 6 according to an embodiment of the present disclosure. .
  • step S701 for each of the image information and one or more pieces of subject information from the storage information, use a multi-modal image extraction module to extract multi-modal image features of the information.
  • a multimodal text extraction module is used to extract multimodal text features of the storage information from the text information of the storage information.
  • the similarity between the multimodal image features of the information and the multimodal text features of the storage information is calculated as the information similarity score.
  • step S707 from the image information and one or more pieces of subject information of the storage information, the information with the largest similarity score is selected.
  • step S709 use the image feature extraction module to extract the single-modal image features of the storage information from the information with the maximum similarity score.
  • the information with the largest similarity score is stored in the multimodal information database as the subject information of the incoming information.
  • the information closest to the text information of the storage information is selected to extract the storage information
  • the single-modal image feature of the information ensures the accuracy and consistency of the image and text of the extracted single-modal image feature of the stored information.
  • Fig. 8 shows a structural block diagram of a retrieval device 800 for a multimodal information base according to an embodiment of the present disclosure.
  • the retrieval device 800 includes: a retrieval feature extraction module 801 configured to: use the first multimodal feature extraction module to extract from the retrieval feature in response to receiving the retrieval information including the first modality information Extract the first modal feature of the retrieval information from the first modal information of the information; the target matching module 802 is configured to: based on the first modal feature of the retrieval information and each of the multiple pieces of target information The similarity between the first modal feature and the second modal feature of each piece of target information, select the first group of target information among the multiple pieces of target information, wherein the first modal feature of each piece of target information is extracted from the first modality information of the target information by using the first multimodal feature extraction module, and the second modality feature of each piece of target information is extracted from the target information by using the second multimodal feature extraction module extracted from the second modality information; and the retrieval result generating module 803 is configured to: generate a retrieval result based on the first group of target information.
  • the multimodal information base includes a plurality of target information including first modality information and second modality information.
  • the target matching module 802 includes: a second target information selection module configured to: based on the similarity between the first modality feature of the retrieved information and the first modality feature of each piece of target information in the multiple pieces of target information degree, select the second group of target information among the multiple pieces of target information; and the first target information selection module is configured to: based on the first modality feature of the retrieved information and the second of each target information in the second group of targets The similarity of the modal features selects the first set of target information from the second set of target information.
  • the first modality information is image information
  • the retrieval device 800 further includes a subject feature extraction module, including: a subject detection module configured to: use the subject detection module to retrieve information from the first modality information Extract one or more pieces of subject information; the single-modal image feature extraction module is configured to: for each piece of subject information, use the image feature extraction module to extract the single-modal image feature of the subject information from the subject information; and
  • the third target information selection module is configured to: select multiple pieces of target information based on the similarity between the unimodal image features of one or more pieces of subject information and the unimodal image features of each piece of target information in the multiple pieces of target information
  • the third group of target information in , wherein the single-modal image feature of each target information is extracted from the first modality information of the target information using the image feature extraction module, and wherein the target matching module 802 includes: similar A degree calculation module configured to: for each piece of target information in the third group of target information, based on the similarity between the first modality feature of the retrieved information and
  • the one or more pieces of subject information include multiple pieces of subject information
  • the first target information selection module includes: a subject information matching module configured to: for each piece of subject information, based on the single mode of the subject information The similarity between the modal image features and the unimodal image features of each piece of target information in the target information, select multiple pieces of target information corresponding to the subject information from the multiple pieces of target information; and select multiple pieces of target information corresponding to each piece of subject information Target information, as the third group of target information.
  • the retrieval device 800 further includes a multimodal feature retrieval module configured as: a multimodal sub-feature extraction module configured to: respond to receiving information including the first modality information and the second modality information To retrieve information, use the first multimodal feature extraction module to extract the first modal feature of the retrieved information, and use the second multimodal feature extraction module to extract the second modal feature of the retrieved information; multimodal feature generation A module configured to: generate a multimodal feature of the retrieved information based on the first modal feature and a second modal feature of the retrieved information; and a first target information selection module configured to: based on the multimodal feature of the retrieved information The similarity between the feature and the multimodal feature of each piece of target information in the multiple pieces of target information is selected, and the first group of target information in the multiple pieces of target information is selected, wherein the multimodal feature of each piece of target information is based on the target information generated by the first modal feature and the second modal feature.
  • a multimodal feature retrieval module configured as: a multimodal sub-feature
  • the multimodal feature generation module includes: a product calculation module configured to: for each of the first and second modal features of the retrieved information, multiply the modal feature by the modal The weight corresponding to the modal feature is obtained to obtain the product corresponding to the modal feature; and the normalization module is configured to: normalize the sum of the products corresponding to the first modal feature and the second modal feature of the retrieved information Integrate to obtain the multimodal features of the retrieved information.
  • the first multimodal feature extraction module and the second multimodal feature extraction module are obtained by training based on a loss function, wherein the loss function is obtained by the first multimodal feature extraction module and the second multimodal feature extraction module.
  • the loss function is obtained by the first multimodal feature extraction module and the second multimodal feature extraction module.
  • the first modality information is any one of image information and text information
  • the second modality information is the other of image information and text information
  • Fig. 9 shows a structural block diagram of a management device 900 for a multimodal information base according to an embodiment of the present disclosure.
  • the management device 900 includes: a warehouse-in information extraction module 901 configured to: in response to receiving the warehouse-in information including the first modal information and the second modal information, use the first A multi-modal feature extraction module extracts the first modal feature of the warehousing information from the first modal information of the warehousing information, and uses a second multi-modal feature extraction module to extract the Extract the second modal feature of the storage information from the second modal information of the storage information; the multi-modal information generating module 902 is configured to: based on the first modal feature and the second modal feature of the storage information modal features, and calculate the multimodal features of the storage information; the retrieval object generation module 903 is configured to: generate One or more search objects in the multi-modal information base corresponding to the storage information; and the retrieval device 800 for the multi-modal information base as described in the present disclosure.
  • a warehouse-in information extraction module 901 configured to: in response to receiving the warehouse-in information including the first modal information and the second modal information, use
  • the first modality information is any one of image information and text information
  • the second modality information is the other one of image information and text information
  • the management device 900 further includes: a storage subject detection module , is configured to: use the subject detection module to extract one or more pieces of subject information of the storage information from the image information of the storage information; and the storage feature extraction module is configured to: use the first multi-modal feature extraction module, the second multimodal feature extraction module and the image feature extraction module extract the single-modal image features of the storage information from one or more pieces of subject information of the storage information, and, wherein, the retrieval object generation module 903 includes: The retrieval object generation sub-module is configured to: generate the corresponding information in the multimodal information base based on the first modal feature, the second modal feature, the multimodal feature and the single modal image feature of the warehousing information One or more search objects for .
  • the first multimodal feature extraction module is any one of a multimodal image extraction module and a multimodal text extraction module
  • the second multimodal feature extraction module is a multimodal image extraction module and a multimodal text extraction module.
  • the storage feature extraction module includes: a storage image extraction module, configured to: for each of the image information from the storage information and one or more pieces of subject information, use The multimodal image extraction module extracts the multimodal image features of the information;
  • the storage text extraction module is configured to: use the multimodal text extraction module to extract the multimodal information of the storage information from the text information of the storage information state text features;
  • the storage subject selection module is configured to: for each of the image information of the storage information and one or more pieces of subject information, calculate the multi-modal image features of the information and the multi-modality of the storage information
  • the similarity of the text features is used as the similarity score of the information; and the information with the largest similarity score is selected from the image information and one or more pieces of subject information of the stored
  • the multimodal information generation module 902 includes: a storage product calculation module configured to: for each of the first and second modal features of the storage information, the modal feature Multiplied by the weight corresponding to the modal feature to obtain the product corresponding to the modal feature; The sum of the corresponding products is normalized to obtain the multimodal features of the retrieved information.
  • the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
  • an electronic device a readable storage medium, and a computer program product are also provided.
  • the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor.
  • the processor executes to enable at least one processor to perform the method as described in the present disclosure.
  • the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method as described in the present disclosure.
  • the present disclosure provides a computer program product comprising a computer program, wherein the computer program implements the method as described in the present disclosure when executed by a processor.
  • Electronic device is intended to mean various forms of digital electronic computing equipment, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 1000 includes a computing unit 1001 that can be executed according to a computer program stored in a read-only memory (ROM) 1002 or loaded from a storage unit 1008 into a random-access memory (RAM) 1003. Various appropriate actions and treatments. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored.
  • the computing unit 1001, ROM 1002, and RAM 1003 are connected to each other through a bus 1004.
  • An input/output (I/O) interface 1005 is also connected to the bus 1004 .
  • the input unit 1006 can be any type of equipment capable of inputting information to the device 1000, the input unit 1006 can receive input digital or character information, and generate key signal input related to user settings and/or function control of the electronic device, and can Including but not limited to mouse, keyboard, touch screen, trackpad, trackball, joystick, microphone and/or remote control.
  • the output unit 1007 may be any type of device capable of presenting information, and may include, but is not limited to, a display, a speaker, a video/audio output terminal, a vibrator, and/or a printer.
  • the storage unit 1008 may include, but is not limited to, a magnetic disk and an optical disk.
  • the communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks, and may include but not limited to a modem, a network card, an infrared communication device, a wireless communication transceiver and/or a chipset , such as a BluetoothTM device, a 1302.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.
  • the computing unit 1001 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 1001 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 1001 executes various methods and processes described above, such as methods 200 , 300 , 300 , 400 , 500 and/or 600 . For example, in some embodiments, methods 200 , 300 , 300 , 400 , 500 , and/or 600 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1008 .
  • part or all of the computer program may be loaded and/or installed on the device 1000 via the ROM 1002 and/or the communication unit 1009.
  • the computer program When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of methods 200, 300, 300, 400, 500 and/or 600 described above may be performed.
  • the computing unit 1001 may be configured to execute the methods 200 , 300 , 300 , 400 , 500 and/or 600 in any other suitable manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to the technical field of artificial intelligence, in particular relates to the technical fields of computer vision and deep learning, and provides a retrieval method and management method for a multimodal information base, which can be applied to scenarios such as image recognition and image search. The implementation solution is: in response to receiving retrieval information which comprises first modal information, using a first multimodal feature extraction module to extract a first modal feature of the retrieval information from the first modal information of the retrieval information; selecting a first group of target information among a plurality of pieces of target information on the basis of the similarity between the first modal feature of the retrieval information and each of a first modal feature and second modal feature of each piece of target information among the plurality of pieces of target information; and generating a retrieval result on the basis of the first group of target information.

Description

多模态信息库的检索方法、管理方法、装置、设备和介质Retrieval method, management method, device, equipment and medium of multimodal information base
相关申请的交叉引用Cross References to Related Applications
本申请要求于2021年8月19日提交的中国专利申请202110955328.9的优先权,其全部内容通过引用整体结合在本申请中。This application claims the priority of Chinese patent application 202110955328.9 filed on August 19, 2021, the entire contents of which are incorporated in this application by reference.
技术领域technical field
本公开涉及人工智能技术领域,尤其涉及计算机视觉和深度学习技术领域,可应用于图像识别和图像搜索等场景下,具体涉及一种用于多模态信息库的检索方法、管理方法、装置、电子设备、计算机可读存储介质和计算机程序产品。The present disclosure relates to the field of artificial intelligence technology, in particular to the field of computer vision and deep learning technology, which can be applied to scenarios such as image recognition and image search, and specifically relates to a retrieval method, management method, device, Electronic devices, computer readable storage media and computer program products.
背景技术Background technique
人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术:人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术以及机器学习/深度学习、大数据处理技术、知识图谱技术等几大方向。Artificial intelligence is a discipline that studies the use of computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing; artificial intelligence software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge map technology and other major directions.
传统的检索信息库难以解决检索信息库中的目标信息图文不符的问题。It is difficult for the traditional retrieval information base to solve the problem of inconsistency between the image and text of the target information in the retrieval information base.
在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明,否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地,除非另有指明,否则此部分中提及的问题不应认为在任何现有技术中已被公认。The approaches described in this section are not necessarily approaches that have been previously conceived or employed. Unless otherwise indicated, it should not be assumed that any approaches described in this section are admitted to be prior art solely by virtue of their inclusion in this section. Similarly, issues mentioned in this section should not be considered to have been recognized in any prior art unless otherwise indicated.
发明内容Contents of the invention
本公开提供了一种用于多模态信息库的检索方法、管理方法、装置、电子设备、计算机可读存储介质和计算机程序产品。The present disclosure provides a retrieval method, management method, device, electronic equipment, computer readable storage medium and computer program product for a multimodal information base.
根据本公开的一方面,提供了一种用于多模态信息库的检索方法,其中,多模态信息库包括多条目标信息,每条目标信息包括第一模态信息和第二模 态信息,该方法包括:响应于接收到包括第一模态信息的检索信息,使用第一多模态特征提取模块,从检索信息的第一模态信息中提取检索信息的第一模态特征;基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择多条目标信息中的第一组目标信息,其中,每条目标信息的第一模态特征为使用第一多模态特征提取模块从该目标信息的第一模态信息中提取的,每条目标信息的第二模态特征为使用第二多模态特征提取模块从该目标信息的第二模态信息中提取的;以及基于第一组目标信息,生成检索结果。According to an aspect of the present disclosure, a retrieval method for a multimodal information base is provided, wherein the multimodal information base includes multiple pieces of target information, and each piece of target information includes first modality information and second modality information information, the method includes: in response to receiving the retrieval information including the first modality information, using a first multimodal feature extraction module to extract the first modality feature of the retrieval information from the first modality information of the retrieval information; Based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each of the multiple pieces of target information, selecting a first group of the multiple pieces of target information Target information, wherein the first modality feature of each target information is extracted from the first modality information of the target information using the first multimodal feature extraction module, and the second modality feature of each target information is using the second multimodal feature extraction module to extract from the second modality information of the target information; and generating retrieval results based on the first set of target information.
根据本公开的另一方面,提供了一种用于多模态信息库的管理方法,该方法包括:响应于接收到包括第一模态信息和第二模态信息的入库信息,使用第一多模态特征提取模块,从入库信息的第一模态信息中提取入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从入库信息的第二模态信息中提取入库信息的第二模态特征;基于入库信息的第一模态特征和第二模态特征,计算入库信息的多模态特征;基于入库信息的第一模态特征、第二模态特征和多模态特征,生成多模态信息库中对应于入库信息的一个或多个检索对象;以及响应于接收到检索信息,执行根据如本公开所述的检索方法。According to another aspect of the present disclosure, there is provided a management method for a multimodal information base, the method comprising: in response to receiving storage information including first modality information and second modality information, using the first A multi-modal feature extraction module extracts the first modal feature of the warehousing information from the first modal information of the warehousing information, and uses the second multi-modal feature extraction module to extract the second modal feature of the warehousing information The second modal feature of the storage information is extracted from the warehousing information; based on the first modal feature and the second modal feature of the storage information, the multi-modal feature of the storage information is calculated; based on the first modal feature of the storage information features, second modality features, and multimodal features, generating one or more search objects corresponding to the incoming information in the multimodal information base; and in response to receiving the search information, performing a search according to the present disclosure method.
根据本公开的另一方面,提供了一种用于多模态信息库的检索装置,其中,多模态信息库包括多个包括第一模态信息和第二模态信息的目标信息,该装置包括:检索特征提取模块,被配置为:响应于接收到包括第一模态信息的检索信息,使用第一多模态特征提取模块,从检索信息的第一模态信息中提取检索信息的第一模态特征;目标匹配模块,被配置为:基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择多条目标信息中的第一组目标信息,其中,每条目标信息的第一模态特征为使用第一多模态特征提取模块从该目标信息的第一模态信息中提取的,每条目标信息的第二模态特征为使用第二多模态特征提取模块从该目标信息的第二模态信息中提取的;以及检索结果生成模块,被配置为:基于第一组目标信息,生成检索结果。According to another aspect of the present disclosure, a retrieval device for a multimodal information base is provided, wherein the multimodal information base includes a plurality of target information including first modality information and second modality information, the The device includes: a retrieval feature extraction module, configured to: in response to receiving the retrieval information including the first modality information, use the first multimodal feature extraction module to extract the information of the retrieval information from the first modality information of the retrieval information The first modal feature; the target matching module, configured to: based on the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information , select the first group of target information among multiple pieces of target information, wherein the first modal feature of each piece of target information is obtained from the first modal information of the target information using the first multi-modal feature extraction module Extracted, the second modal feature of each piece of target information is extracted from the second modal information of the target information by using the second multi-modal feature extraction module; and the retrieval result generation module is configured to: based on the first Group target information to generate search results.
根据本公开的另一方面,提供了一种用于多模态信息库的管理装置,包括:入库信息提取模块,被配置为:响应于接收到包括第一模态信息和第二 模态信息的入库信息,使用第一多模态特征提取模块,从入库信息的第一模态信息中提取入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从入库信息的第二模态信息中提取入库信息的第二模态特征;多模态信息生成模块,被配置为:基于入库信息的第一模态特征和第二模态特征,计算入库信息的多模态特征;检索对象生成模块,被配置为:基于入库信息的第一模态特征、第二模态特征和多模态特征,生成多模态信息库中对应于入库信息的一个或多个检索对象;以及如本公开所述的用于多模态信息库的检索装置。According to another aspect of the present disclosure, there is provided a management device for a multimodal information base, including: a storage information extraction module configured to: respond to receiving information including the first modality and the second modality For the storage information of information, use the first multimodal feature extraction module to extract the first modal feature of the storage information from the first modal information of the storage information, and use the second multimodal feature extraction module, Extract the second modal feature of the warehousing information from the second modal information of the warehousing information; the multi-modal information generation module is configured to: based on the first modal feature and the second modal feature of the warehousing information, Calculate the multimodal features of the storage information; the retrieval object generation module is configured to: based on the first modal features, second modal features and multimodal features of the warehousing information, generate the corresponding One or more retrieval objects of the stored information; and the retrieval device for the multimodal information database as described in the present disclosure.
根据本公开的另一方面,提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行如本公开所述的检索方法和/或管理方法。According to another aspect of the present disclosure, an electronic device is provided, including: at least one processor; and a memory communicated with the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by Execution by at least one processor, so that at least one processor can execute the retrieval method and/or management method as described in the present disclosure.
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行如本公开所述的检索方法和/或管理方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the retrieval method and/or management method as described in the present disclosure.
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,其中,计算机程序在被处理器执行时实现如本公开所述的检索方法和/或管理方法。According to another aspect of the present disclosure, a computer program product is provided, including a computer program, wherein the computer program implements the retrieval method and/or the management method as described in the present disclosure when executed by a processor.
根据本公开的一个或多个实施例,可以对多模态信息库中的多种模态信息检索,避免多模态信息库中的同一目标信息的不同模态信息之间不相符的问题。According to one or more embodiments of the present disclosure, various modal information in the multi-modal information base can be retrieved, avoiding the problem of inconsistency between different modal information of the same target information in the multi-modal information base.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图说明Description of drawings
附图示例性地示出了实施例并且构成说明书的一部分,与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的,并不限制权利要求的范围。在所有附图中,相同的附图标记指代类似但不一定相同的要素。The drawings exemplarily illustrate the embodiment and constitute a part of the specification, and together with the text description of the specification, serve to explain the exemplary implementation of the embodiment. The illustrated embodiments are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, like reference numbers designate similar, but not necessarily identical, elements.
图1示出了根据本公开的实施例的可以在其中实施本文描述的各种方法的示例性***的示意图;FIG. 1 shows a schematic diagram of an exemplary system in which various methods described herein may be implemented according to an embodiment of the present disclosure;
图2示出了根据本公开的实施例的用于多模态信息库的检索方法的流程图;FIG. 2 shows a flow chart of a retrieval method for a multimodal information base according to an embodiment of the present disclosure;
图3示出了根据本公开的实施例的用于多模态信息库的检索方法的流程图;FIG. 3 shows a flow chart of a retrieval method for a multimodal information base according to an embodiment of the present disclosure;
图4示出了根据本公开的实施例的用于多模态信息库的检索方法的流程图;FIG. 4 shows a flow chart of a retrieval method for a multimodal information base according to an embodiment of the present disclosure;
图5示出了根据本公开的实施例的用于多模态信息库的管理方法的流程图;FIG. 5 shows a flow chart of a management method for a multimodal information base according to an embodiment of the present disclosure;
图6示出了根据本公开的实施例的用于多模态信息库的管理方法的流程图;Fig. 6 shows a flow chart of a management method for a multimodal information base according to an embodiment of the present disclosure;
图7示出了根据本公开的实施例的在图6的方法中从入库信息的一条或多条主体信息中提取入库信息的单模态图像特征的示例过程的流程图;FIG. 7 shows a flowchart of an example process of extracting single-modal image features of warehouse-in information from one or more pieces of subject information of warehouse-in information in the method of FIG. 6 according to an embodiment of the present disclosure;
图8示出了根据本公开的实施例的用于多模态信息库的检索装置的结构框图;Fig. 8 shows a structural block diagram of a retrieval device for a multimodal information base according to an embodiment of the present disclosure;
图9示出了根据本公开的实施例的用于多模态信息库的管理装置的结构框图;Fig. 9 shows a structural block diagram of a management device for a multimodal information base according to an embodiment of the present disclosure;
图10示出了能够用于实现本公开的实施例的示例性电子设备的结构框图。FIG. 10 shows a structural block diagram of an exemplary electronic device that can be used to implement the embodiments of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
在本公开中,除非另有说明,否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系,这种术语只是用于将一个元件与另一元件区分开。在一些示例中,第一要素和第 二要素可以指向该要素的同一实例,而在某些情况下,基于上下文的描述,它们也可以指代不同实例。In the present disclosure, unless otherwise stated, using the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, temporal relationship or importance relationship of these elements, and such terms are only used for Distinguishes one element from another. In some examples, the first element and the second element can refer to the same instance of the element, and in some cases, they can also refer to different instances based on the description of the context.
在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的,而并非旨在进行限制。除非上下文另外明确地表明,如果不特意限定要素的数量,则该要素可以是一个也可以是多个。此外,本公开中所使用的术语“和/或”涵盖所列出的项目中的任何一个以及全部可能的组合方式。The terminology used in describing the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, there may be one or more elements. In addition, the term "and/or" used in the present disclosure covers any one and all possible combinations of the listed items.
下面将结合附图详细描述本公开的实施例。Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
图1示出了根据本公开的实施例可以将本文描述的各种方法和装置在其中实施的示例性***100的示意图。参考图1,该***100包括一个或多个客户端设备101、102、103、104、105和106、服务器120以及将一个或多个客户端设备耦接到服务器120的一个或多个通信网络110。客户端设备101、102、103、104、105和106可以被配置为执行一个或多个应用程序。FIG. 1 shows a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented according to an embodiment of the present disclosure. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks coupling the one or more client devices to the server 120 110. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.
在本公开的实施例中,服务器120可以运行使得能够执行如本公开所述的用于多模态信息库的检索方法和/或管理方法的一个或多个服务或软件应用。In an embodiment of the present disclosure, the server 120 may run one or more services or software applications enabling execution of the retrieval method and/or management method for the multimodal information base as described in the present disclosure.
在某些实施例中,服务器120还可以提供可以包括非虚拟环境和虚拟环境的其他服务或软件应用。在某些实施例中,这些服务可以作为基于web的服务或云服务提供,例如在软件即服务(SaaS)模型下提供给客户端设备101、102、103、104、105和/或106的用户。In some embodiments, server 120 may also provide other services or software applications that may include non-virtualized environments and virtualized environments. In some embodiments, these services may be provided as web-based services or cloud services, such as under a software-as-a-service (SaaS) model to users of client devices 101, 102, 103, 104, 105, and/or 106 .
在图1所示的配置中,服务器120可以包括实现由服务器120执行的功能的一个或多个组件。这些组件可以包括可由一个或多个处理器执行的软件组件、硬件组件或其组合。操作客户端设备101、102、103、104、105和/或106的用户可以依次利用一个或多个客户端应用程序来与服务器120进行交互以利用这些组件提供的服务。应当理解,各种不同的***配置是可能的,其可以与***100不同。因此,图1是用于实施本文所描述的各种方法的***的一个示例,并且不旨在进行限制。In the configuration shown in FIG. 1 , server 120 may include one or more components that implement the functions performed by server 120 . These components may include software components, hardware components or combinations thereof executable by one or more processors. Users operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client application programs to interact with server 120 to utilize the services provided by these components. It should be understood that various different system configurations are possible, which may differ from system 100 . Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein, and is not intended to be limiting.
用户可以使用客户端设备101、102、103、104、105和/或106来检索多模态信息库中的目标信息(例如,上传检索信息),或者,向多模态信息库添加目标信息(例如,上传入库信息)。客户端设备可以提供使客户端设备 的用户能够与客户端设备进行交互的接口。客户端设备还可以经由该接口向用户输出信息。尽管图1仅描绘了六种客户端设备,但是本领域技术人员将能够理解,本公开可以支持任何数量的客户端设备。A user may use a client device 101, 102, 103, 104, 105, and/or 106 to retrieve target information in a multimodal information repository (e.g., upload retrieved information), or to add target information to a multimodal repository ( For example, upload incoming library information). A client device may provide an interface that enables a user of the client device to interact with the client device. The client device can also output information to the user via the interface. Although FIG. 1 depicts only six client devices, those skilled in the art will understand that the present disclosure can support any number of client devices.
客户端设备101、102、103、104、105和/或106可以包括各种类型的计算机设备,例如便携式手持设备、通用计算机(诸如个人计算机和膝上型计算机)、工作站计算机、可穿戴设备、游戏***、瘦客户端、各种消息收发设备、传感器或其他感测设备等。这些计算机设备可以运行各种类型和版本的软件应用程序和操作***,例如MICROSOFT Windows、APPLE iOS、类UNIX操作***、Linux或类Linux操作***(例如GOOGLE Chrome OS);或包括各种移动操作***,例如MICROSOFT Windows Mobile OS、iOS、Windows Phone、Android。便携式手持设备可以包括蜂窝电话、智能电话、平板电脑、个人数字助理(PDA)等。可穿戴设备可以包括头戴式显示器和其他设备。游戏***可以包括各种手持式游戏设备、支持互联网的游戏设备等。客户端设备能够执行各种不同的应用程序,例如各种与Internet相关的应用程序、通信应用程序(例如电子邮件应用程序)、短消息服务(SMS)应用程序,并且可以使用各种通信协议。 Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computing devices, such as portable handheld devices, general-purpose computers (such as personal computers and laptops), workstation computers, wearable devices, Gaming systems, thin clients, various messaging devices, sensors or other sensing devices, etc. These computer devices can run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux or Linux-like operating systems (such as GOOGLE Chrome OS); or include various mobile operating systems , such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular phones, smart phones, tablet computers, personal digital assistants (PDAs), and the like. Wearable devices can include head-mounted displays and other devices. Gaming systems may include various handheld gaming devices, Internet-enabled gaming devices, and the like. A client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (eg, email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.
网络110可以是本领域技术人员熟知的任何类型的网络,其可以使用多种可用协议中的任何一种(包括但不限于TCP/IP、SNA、IPX等)来支持数据通信。仅作为示例,一个或多个网络110可以是局域网(LAN)、基于以太网的网络、令牌环、广域网(WAN)、因特网、虚拟网络、虚拟专用网络(VPN)、内部网、外部网、公共交换电话网(PSTN)、红外网络、无线网络(例如蓝牙、WIFI)和/或这些和/或其他网络的任意组合。 Network 110 can be any type of network known to those skilled in the art that can support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, and the like. By way of example only, the one or more networks 110 may be a local area network (LAN), an Ethernet-based network, a token ring, a wide area network (WAN), the Internet, a virtual network, a virtual private network (VPN), an intranet, an extranet, Public switched telephone network (PSTN), infrared network, wireless network (eg Bluetooth, WIFI) and/or any combination of these and/or other networks.
服务器120可以包括一个或多个通用计算机、专用服务器计算机(例如PC(个人计算机)服务器、UNIX服务器、中端服务器)、刀片式服务器、大型计算机、服务器群集或任何其他适当的布置和/或组合。服务器120可以包括运行虚拟操作***的一个或多个虚拟机,或者涉及虚拟化的其他计算架构(例如可以被虚拟化以维护服务器的虚拟存储设备的逻辑存储设备的一个或多个灵活池)。在各种实施例中,服务器120可以运行提供下文所描述的功能的一个或多个服务或软件应用。 Server 120 may include one or more general purpose computers, dedicated server computers (e.g., PC (personal computer) servers, UNIX servers, midrange servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination . Server 120 may include one or more virtual machines running virtual operating systems, or other computing architectures involving virtualization (eg, one or more flexible pools of logical storage devices that may be virtualized to maintain the server's virtual storage devices). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.
服务器120中的计算单元可以运行包括上述任何操作***以及任何商业上可用的服务器操作***的一个或多个操作***。服务器120还可以运行各种附加服务器应用程序和/或中间层应用程序中的任何一个,包括HTTP服务器、FTP服务器、CGI服务器、JAVA服务器、数据库服务器等。Computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle-tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.
在一些实施方式中,服务器120可以包括一个或多个应用程序,以分析和合并从客户端设备101、102、103、104、105和106的用户接收的数据馈送和/或事件更新。服务器120还可以包括一个或多个应用程序,以经由客户端设备101、102、103、104、105和106的一个或多个显示设备来显示数据馈送和/或实时事件(在一些实施方式中,服务器120可以包括一个或多个应用程序,例如,基于图像、视频、语音、文本、数字信号等数据的目标检测与识别、信号转换等服务的应用程序,以处理从客户端设备101、102、103、104、105和106接收的语音交互、文本分类、图像识别或关键点检测等任务请求。服务器可以根据具体的深度学习任务,利用训练样本训练神经网络模型,并且可以对神经网络模型的超网络模块中的各个子网络进行测试,根据各个子网络的测试结果,确定用于执行深度学习任务的神经网络模型的结构和参数。可以将各种数据作为深度学习任务的训练样本数据,如图像数据、音频数据、视频数据或文本数据等。在神经网络模型的训练完成后,服务器120还可以通过模型搜索技术自动搜索出最优模型结构来执行相应的任务)。In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101 , 102 , 103 , 104 , 105 , and 106 . Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106 (in some embodiments , the server 120 may include one or more application programs, for example, application programs based on services such as image, video, voice, text, digital signal, etc. , 103, 104, 105, and 106 receive task requests such as voice interaction, text classification, image recognition, or key point detection. The server can use training samples to train the neural network model according to the specific deep learning task, and can use the training samples for the neural network model. Each sub-network in the super network module is tested, and according to the test results of each sub-network, the structure and parameters of the neural network model for performing deep learning tasks can be determined. Various data can be used as training sample data for deep learning tasks, such as Image data, audio data, video data or text data, etc. After the training of the neural network model is completed, the server 120 can also automatically search for the optimal model structure through model search technology to perform corresponding tasks).
在一些实施方式中,服务器120可以为分布式***的服务器,或者是结合了区块链的服务器。服务器120也可以是云服务器,或者是带人工智能技术的智能云计算服务器或智能云主机。云服务器是云计算服务体系中的一项主机产品,以解决传统物理主机与虚拟专用服务器(VPS,Virtual Private Server)服务中存在的管理难度大、业务扩展性弱的缺陷。In some implementations, the server 120 may be a server of a distributed system, or a server combined with blockchain. The server 120 can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology. Cloud server is a host product in the cloud computing service system to solve the defects of difficult management and weak business scalability existing in traditional physical host and virtual private server (VPS, Virtual Private Server) services.
***100还可以包括一个或多个数据库130。在某些实施例中,这些数据库可以用于存储数据和其他信息。例如,数据库130中的一个或多个可用于存储诸如音频文件和视频文件的信息。数据存储库130可以驻留在各种位置。例如,由服务器120使用的数据存储库可以在服务器120本地,或者可以远离服务器120且可以经由基于网络或专用的连接与服务器120通信。数据存储库130可以是不同的类型。在某些实施例中,由服务器120使用的数 据存储库可以是数据库,例如关系数据库。这些数据库中的一个或多个可以响应于命令而存储、更新和检索到数据库以及来自数据库的数据。 System 100 may also include one or more databases 130 . In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Data repository 130 may reside in various locations. For example, the data store used by server 120 may be local to server 120, or may be remote from server 120 and may communicate with server 120 via a network-based or dedicated connection. Data repository 130 can be of different types. In some embodiments, the data store used by server 120 may be a database, such as a relational database. One or more of these databases may store, update and retrieve the database and data from the database in response to commands.
在某些实施例中,数据库130中的一个或多个还可以由应用程序使用来存储应用程序数据。由应用程序使用的数据库可以是不同类型的数据库,例如键值存储库,对象存储库或由文件***支持的常规存储库。In some embodiments, one or more of databases 130 may also be used by applications to store application data. Databases used by applications can be different types of databases such as key-value stores, object stores or regular stores backed by a file system.
图1的***100可以以各种方式配置和操作,以使得能够应用根据本公开所描述的各种方法和装置。The system 100 of FIG. 1 may be configured and operated in various ways to enable application of the various methods and apparatuses described in accordance with this disclosure.
如上所述,传统的检索信息库基于目标信息的文本关键词或图片内容进行检索,因此,期望提供一种基于目标信息的多种模态信息(例如,图像信息和文本信息)进行检索的方法,以避免同一目标信息的多种模态信息之间不相符的情况。As mentioned above, the traditional retrieval information base is retrieved based on the text keywords or image content of the target information, therefore, it is expected to provide a retrieval method based on various modal information (such as image information and text information) of the target information , to avoid inconsistencies between multiple modal information of the same target information.
本公开的实施例提供了一种用于多模态信息库的检索方法,其中,多模态信息库包括多条目标信息,每条目标信息包括第一模态信息和第二模态信息,该方法包括:响应于接收到包括第一模态信息的检索信息,使用第一多模态特征提取模块,从检索信息的第一模态信息中提取检索信息的第一模态特征;基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择多条目标信息中的第一组目标信息,其中,每条目标信息的第一模态特征为使用第一多模态特征提取模块从该目标信息的第一模态信息中提取的,每条目标信息的第二模态特征为使用第二多模态特征提取模块从该目标信息的第二模态信息中提取的;以及基于第一组目标信息,生成检索结果。An embodiment of the present disclosure provides a retrieval method for a multimodal information base, wherein the multimodal information base includes multiple pieces of target information, and each piece of target information includes first modality information and second modality information, The method includes: in response to receiving retrieval information including the first modality information, using a first multimodal feature extraction module to extract the first modality feature of the retrieval information from the first modality information of the retrieval information; the similarity between the first modal feature of the information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information, and select the first group of target information in the multiple pieces of target information , wherein the first modal feature of each piece of target information is extracted from the first modal information of the target information using the first multimodal feature extraction module, and the second modal feature of each piece of target information is extracted using the first multimodal feature extraction module The second multimodal feature extraction module extracts from the second modality information of the target information; and generates a retrieval result based on the first set of target information.
图2示出了根据本公开的实施例的用于多模态信息库的检索方法200的流程图。根据一些实施例,多模态信息库包括多条目标信息,每条目标信息包括第一模态信息和第二模态信息。FIG. 2 shows a flowchart of a retrieval method 200 for a multimodal information base according to an embodiment of the present disclosure. According to some embodiments, the multimodal information base includes multiple pieces of target information, and each piece of target information includes first modality information and second modality information.
在步骤S201处,响应于接收到包括第一模态信息的检索信息,使用第一多模态特征提取模块,从检索信息的第一模态信息中提取检索信息的第一模态特征。At step S201, in response to receiving the retrieval information including the first modality information, the first multimodal feature extraction module is used to extract the first modality feature of the retrieval information from the first modality information of the retrieval information.
根据一些实施例,检索信息为客户端向服务器发送的检索请求,例如,当用户希望在网上检索某条在线下看到的裙子时,可以在客户端输入“裙子”或拍摄看到的裙子的图片,并通过客户端向服务器发送相应的检索请求。According to some embodiments, the retrieval information is a retrieval request sent by the client to the server. For example, when the user wishes to retrieve a certain skirt seen offline on the Internet, he can input "dress" or take a picture of the skirt he saw on the client. image, and send a corresponding retrieval request to the server through the client.
根据一些实施例,检索信息的第一模态信息可以为文本信息或图像信息,其中,当检索信息的第一模态信息为文本信息时,使用多模态文本提取模块(例如,Bert Base网络)提取检索信息的多模态文本特征,当检索信息的第一模态信息为图像信息时,使用多模态图像提取模块(例如,ViT Base网络)提取检索信息的多模态图像特征。According to some embodiments, the first modality information of retrieved information may be text information or image information, wherein, when the first modality information of retrieved information is text information, a multimodal text extraction module (for example, Bert Base network ) to extract the multimodal text features of the retrieved information, when the first modality information of the retrieved information is image information, a multimodal image extraction module (for example, ViT Base network) is used to extract the multimodal image features of the retrieved information.
在步骤S203处,基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择多条目标信息中的第一组目标信息,其中,每条目标信息的第一模态特征为使用第一多模态特征提取模块从该目标信息的第一模态信息中提取的,每条目标信息的第二模态特征为使用第二多模态特征提取模块从该目标信息的第二模态信息中提取的。At step S203, multiple pieces of target information are selected based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information The first group of target information in , wherein the first modality feature of each target information is extracted from the first modality information of the target information using the first multi-modal feature extraction module, and the first modality feature of each target information The two-modal feature is extracted from the second modality information of the target information by using the second multi-modal feature extraction module.
根据一些实施例,选择多条目标信息中的第一组目标信息包括:基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征的相似度,选择多条目标信息中的第二组目标信息;以及基于检索信息的第一模态特征与第二组目标中的每条目标信息的第二模态特征的相似度,从第二组目标信息中选择第一组目标信息。According to some embodiments, selecting the first group of target information among the multiple pieces of target information includes: based on the similarity between the first modality feature of the retrieved information and the first modality feature of each piece of target information among the multiple pieces of target information, selecting A second set of target information in the multiple pieces of target information; and based on the similarity between the first modality feature of the retrieved information and the second modality feature of each target information in the second set of targets, from the second set of target information Select the first set of target information.
根据另一些实施例,选择多条目标信息中的第一组目标信息包括:基于检索信息的第一模态特征与每条目标信息的第一模态特征的相似度和检索信息的第二模态特征与每条目标信息的第二模态特征的相似度,计算每条目标信息的检索分数,并且,基于检索分数,从多条目标信息中选择第一组目标信息。According to some other embodiments, selecting the first group of target information among the multiple pieces of target information includes: based on the similarity between the first modality feature of the retrieved information and the first modality feature of each piece of target information and the second modality of the retrieved information calculating the similarity between the modal features and the second modal features of each piece of target information, calculating a retrieval score for each piece of target information, and selecting a first group of target information from multiple pieces of target information based on the retrieval scores.
根据一些实施例,基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征的相似度,对多条目标信息进行排序,并且选择多条目标信息中相似度最高的前第一预定数量条目标信息,作为第二组目标信息。根据另一些实施例,从多条目标信息中选择检索信息的第一模态特征与其第一模态特征的相似度大于第一相似度阈值的目标信息,作为第二组目标信息。According to some embodiments, based on the similarity between the first modal feature of the retrieved information and the first modal feature of each piece of target information in the multiple pieces of target information, the multiple pieces of target information are sorted, and the multiple pieces of target information are selected. The first predetermined number of pieces of target information with the highest similarity are used as the second group of target information. According to some other embodiments, target information whose similarity between the first modal feature of the retrieved information and the first modal feature is greater than a first similarity threshold is selected from the multiple pieces of target information as the second group of target information.
根据一些实施例,基于检索信息的第二模态特征与多条目标信息中的每条目标信息的第二模态特征的相似度,对第二组目标信息进行排序,并且选择第二组目标信息中相似度最高的前第二预定数量条目标信息,作为第一组 目标信息。根据另一些实施例,从第二组目标信息中选择检索信息的第二模态特征与其第二模态特征的相似度大于第二相似度阈值的目标信息,作为第一组目标信息。根据又一些实施例,可以选择第二组目标信息中的所有目标信息(即,第一预定数量与第二预定数量相同),作为第一组目标信息,而仅根据检索信息的第二模态特征与其第二模态特征的相似度,对这些目标信息进行排序。According to some embodiments, the second group of object information is sorted based on the similarity between the second modality feature of the retrieved information and the second modality feature of each piece of object information in the plurality of pieces of object information, and the second group of objects is selected The first second predetermined number of pieces of target information with the highest similarity in the information are used as the first group of target information. According to some other embodiments, target information whose similarity between the second modal feature of the retrieved information and the second modal feature is greater than a second similarity threshold is selected from the second group of target information as the first set of target information. According to still other embodiments, all object information in the second group of object information (that is, the first predetermined number is the same as the second predetermined number) may be selected as the first group of object information, and only according to the second modality of retrieving information The similarity between the feature and its second modality feature ranks these target information.
根据一些实施例,检索信息的第一模态特征与每条目标信息的第一模态特征的相似度和/或检索信息的第二模态特征与每条目标信息的第二模态特征的相似度为余弦相似度。According to some embodiments, the similarity between the first modal feature of the retrieved information and the first modal feature of each piece of target information and/or the similarity between the second modal feature of the retrieved information and the second modal feature of each piece of target information The similarity is cosine similarity.
根据一些实施例,第一模态信息为图像信息和文本信息中的任一个,第二模态信息为图像信息和文本信息中的另一个。According to some embodiments, the first modality information is any one of image information and text information, and the second modality information is the other of image information and text information.
在步骤S205处,基于第一组目标信息,生成检索结果。At step S205, a retrieval result is generated based on the first group of target information.
根据一些实施例,基于检索信息的第一模态特征与每条目标信息的第一模态特征的相似度和/或检索信息的第二模态特征与每条目标信息的第二模态特征的相似度,确定检索结果中的目标信息排列顺序。根据一些实施例,基于检索信息的第一模态特征与每条目标信息的第一模态特征的相似度和/或检索信息的第二模态特征与每条目标信息的第二模态特征的相似度,计算每条目标信息的检索分数,并且,基于检索分数,生成对应于第一组目标信息的检索结果。According to some embodiments, based on the similarity between the first modal feature of the retrieved information and the first modal feature of each piece of target information and/or the second modal feature of the retrieved information and the second modal feature of each piece of target information The similarity degree determines the ranking order of the target information in the retrieval results. According to some embodiments, based on the similarity between the first modal feature of the retrieved information and the first modal feature of each piece of target information and/or the second modal feature of the retrieved information and the second modal feature of each piece of target information , calculate a retrieval score for each piece of target information, and, based on the retrieval score, generate a retrieval result corresponding to the first group of target information.
在如本公开的实施例所述的用于多模态信息库的检索方法中,即使用户仅输入了单模态检索信息,也能够基于目标信息的多个模态信息进行检索,避免多模态信息库中的同一目标信息的不同模态信息之间不相符的问题(例如,某一目标信息的图像信息与文本信息不符合)。In the retrieval method for the multimodal information base as described in the embodiments of the present disclosure, even if the user only inputs single-modal retrieval information, it is possible to perform retrieval based on multiple modal information of the target information, avoiding multimodal The problem of inconsistency between different modal information of the same target information in the modal information database (for example, the image information of a certain target information does not match the text information).
根据一些实施例,第一模态信息为图像信息,如本公开所述的用于多模态信息库的检索方法还包括在选择多条目标信息中的第一组目标信息之前:使用主体检测模块,从检索信息的第一模态信息中提取一条或多条主体信息;对于每条主体信息,使用图像特征提取模块,从该主体信息中提取该主体信息的单模态图像特征;以及基于一条或多条主体信息的单模态图像特征与多条目标信息中的每条目标信息的单模态图像特征的相似度,选择多条目标信息中的第三组目标信息,其中,每条目标信息的单模态图像特征为使用图像 特征提取模块从该目标信息的第一模态信息中提取的,并且其中,选择多条目标信息中的第一组目标信息包括:对于第三组目标信息中的每条目标信息,基于检索信息的第一模态特征与该目标信息的第一模态特征的相似度和检索信息的第一模态特征与该目标信息的第二模态特征的相似度,计算该目标信息的相似度分数;以及基于第三组目标信息中的每条目标信息的相似度分数,从第三组目标信息中选择第一组目标信息。According to some embodiments, the first modality information is image information, and the retrieval method for a multimodal information base according to the present disclosure further includes: before selecting the first group of target information among the plurality of pieces of target information: using subject detection module, extracting one or more pieces of subject information from the first modal information of the retrieved information; for each piece of subject information, using an image feature extraction module to extract the single-modal image features of the subject information from the subject information; and based on The similarity between the unimodal image features of one or more pieces of subject information and the unimodal image features of each piece of target information in the multiple pieces of target information, select the third group of target information in the multiple pieces of target information, where each The single-modal image feature of the target information is extracted from the first modality information of the target information by using the image feature extraction module, and wherein, selecting the first group of target information among the multiple pieces of target information includes: for the third group of targets For each piece of target information in the information, based on the similarity between the first modal feature of the retrieved information and the first modal feature of the target information and the difference between the first modal feature of the retrieved information and the second modal feature of the target information Calculate the similarity score of the target information; and select the first set of target information from the third set of target information based on the similarity score of each piece of target information in the third set of target information.
图3示出了根据本公开的实施例的用于多模态信息库的检索方法300的流程图。FIG. 3 shows a flowchart of a retrieval method 300 for a multimodal information base according to an embodiment of the present disclosure.
在步骤S301处,响应于接收到包括第一模态信息的检索信息,使用第一多模态特征提取模块,从检索信息的第一模态信息中提取检索信息的第一模态特征,其中,第一模态信息为图像信息。根据一些实施例,步骤S301可以与图2中的步骤S201类似地执行。At step S301, in response to receiving the retrieval information including the first modality information, use the first multimodal feature extraction module to extract the first modality feature of the retrieval information from the first modality information of the retrieval information, wherein , the first modality information is image information. According to some embodiments, step S301 may be performed similarly to step S201 in FIG. 2 .
在步骤S303处,使用主体检测模块,从检索信息的第一模态信息中提取一条或多条主体信息。At step S303, use the subject detection module to extract one or more pieces of subject information from the first modality information of the retrieved information.
根据一些实施例,在主体检测模块处,使用目标检测器(例如,YOLO-v3),对检索信息的第一模态信息(即,图像信息)进行目标检测,其中,对于检测到的检测框,筛选出其中置信度较高、尺寸和位置较为适当的检测框(例如,过滤掉置信度较低、尺寸较小和较为靠近图片边界的检测框),并且,提取第一模态信息中对应于筛选出的检测框的信息,作为一条或多条主体信息。根据一些实施例,可以先从检测框中过滤掉尺寸较小和较为靠近图片边界的检测框,再选择剩余检测框中置信度最高的两个检测框,作为所筛选出的检测框。According to some embodiments, at the subject detection module, a target detector (for example, YOLO-v3) is used to perform target detection on the first modality information (ie, image information) of the retrieved information, wherein, for the detected detection frame , to filter out the detection frames with higher confidence and appropriate size and position (for example, filter out the detection frames with lower confidence, smaller size and closer to the border of the picture), and extract the corresponding The information based on the selected detection frame is used as one or more pieces of subject information. According to some embodiments, the detection frames with smaller size and closer to the picture boundary may be filtered out from the detection frames first, and then the two detection frames with the highest confidence among the remaining detection frames are selected as the filtered detection frames.
在步骤S305处,对于每条主体信息,使用图像特征提取模块,从该主体信息中提取该主体信息的单模态图像特征。At step S305, for each piece of subject information, use an image feature extraction module to extract the single-modal image features of the subject information from the subject information.
根据一些实施例,图像特征提取模块与第一多模态特征提取模块结构相同。根据一些实施例,在训练图像特征提取模块之前,进行第一多模态特征提取模块的训练,并且,以第一多模态训练提取模块的已训练参数作为图像特征提取模块的初始化参数,以进行图像特征提取模块的训练(例如,基于有ID标注的图像数据,使用度量学习的方式进行微调)。相较于直接训练 图像特征提取模块,通过以第一多模态训练提取模块的已训练参数作为图像特征提取模块的初始化参数,缩短了训练图像特征提取模块的时间。According to some embodiments, the image feature extraction module has the same structure as the first multimodal feature extraction module. According to some embodiments, before training the image feature extraction module, the training of the first multimodal feature extraction module is performed, and the trained parameters of the first multimodal training extraction module are used as the initialization parameters of the image feature extraction module, so as to Carry out the training of the image feature extraction module (for example, based on the ID-labeled image data, fine-tuning by means of metric learning). Compared with directly training the image feature extraction module, the time for training the image feature extraction module is shortened by using the trained parameters of the first multimodal training extraction module as the initialization parameters of the image feature extraction module.
在步骤S307处,基于一条或多条主体信息的单模态图像特征与多条目标信息中的每条目标信息的单模态图像特征的相似度,选择多条目标信息中的第三组目标信息。At step S307, based on the similarity between the unimodal image features of one or more pieces of subject information and the unimodal image features of each piece of target information in the multiple pieces of target information, a third group of targets in the multiple pieces of target information is selected information.
根据一些实施例,一条或多条主体信息包括多条主体信息,并且其中,选择多条目标信息中的第三组目标信息包括:对于每条主体信息,基于该主体信息的单模态图像特征与目标信息中的每条目标信息的单模态图像特征的相似度,选择多条目标信息中对应于该主体信息的多条目标信息;以及选择对应于每条主体信息的多条目标信息,作为第三组目标信息。根据一些实施例,将对应于每条信息的多条目标信息进行汇总和去重,以得到第三组目标信息。According to some embodiments, the one or more pieces of subject information include a plurality of pieces of subject information, and wherein selecting a third group of target information among the plurality of pieces of target information includes: for each piece of subject information, based on the single-modal image feature of the subject information the similarity with the single-modal image feature of each piece of target information in the target information, selecting multiple pieces of target information corresponding to the subject information among the multiple pieces of target information; and selecting multiple pieces of target information corresponding to each piece of subject information, as the third group of target information. According to some embodiments, multiple pieces of target information corresponding to each piece of information are aggregated and deduplicated to obtain a third set of target information.
根据另一些实施例,一条或多条主体信息包括一条主体信息,并且其中,选择多条目标信息中的第三组目标信息包括:基于该主体信息的单模态图像特征与目标信息中的每条目标信息的单模态图像特征的相似度,选择多条目标信息中对应于该主体信息的多条目标信息,作为第三组目标信息。According to some other embodiments, the one or more pieces of subject information include one piece of subject information, and wherein selecting a third group of target information among the plurality of pieces of target information includes: based on the single-modal image features of the subject information and each of the target information The similarity of the single-modal image features of the pieces of target information is selected, and the multiple pieces of target information corresponding to the subject information are selected as the third group of target information.
根据一些实施例,对于每条主体信息,基于该主体信息的单模态图像特征与多条目标信息中的每条目标信息的单模态图像特征的相似度,对多条目标信息进行排序,并且选择多条目标信息中相似度最高的前第三预定数量条目标信息。根据另一些实施例,对于每条主体信息,从多条目标信息中选择主体信息的单模态图像特征与其单模态图像特征的相似度大于第一相似度阈值的目标信息。According to some embodiments, for each piece of subject information, the multiple pieces of target information are sorted based on the similarity between the unimodal image features of the subject information and the unimodal image features of each piece of target information in the multiple pieces of target information, And select the first third predetermined number of pieces of target information with the highest similarity among the multiple pieces of target information. According to some other embodiments, for each piece of subject information, select from multiple pieces of target information the target information whose similarity between the unimodal image features of the subject information and its unimodal image features is greater than a first similarity threshold.
在步骤S309处,对于第三组目标信息中的每条目标信息,基于检索信息的第一模态特征与该目标信息的第一模态特征的相似度和检索信息的第一模态特征与该目标信息的第二模态特征的相似度,计算该目标信息的相似度分数。At step S309, for each piece of target information in the third group of target information, based on the similarity between the first modal feature of the retrieved information and the first modal feature of the target information and the relationship between the first modal feature of the retrieved information and The similarity of the second modal feature of the target information is used to calculate the similarity score of the target information.
根据一些实施例,选择检索信息的第一模态特征与该目标信息的第一模态特征的相似度和检索信息的第一模态特征与该目标信息的第二模态特征的相似度中的最大值,作为该目标信息的相似度分数。According to some embodiments, the degree of similarity between the first modal feature of the retrieved information and the first modal feature of the target information and the similarity between the first modal feature of the retrieved information and the second modal feature of the target information are selected. The maximum value of is used as the similarity score of the target information.
在步骤S311处,基于第三组目标信息中的每条目标信息的相似度分数,从第三组目标信息中选择第一组目标信息。At step S311, the first group of object information is selected from the third group of object information based on the similarity score of each piece of object information in the third group of object information.
根据一些实施例,基于第三组目标信息中的每条目标信息的相似度分数,从第三组目标信息中选择相似度分数较高的第二预定数量条目标信息。根据另一些实施例,选择第三组目标信息中的相似度分数高于相似度阈值的目标信息,作为第一组目标信息。根据又一些实施例,选择第三组目标信息中的全部目标信息作为第一组目标信息,并且,基于相似度分数,对第一组目标信息中的信息进行排序。According to some embodiments, based on the similarity score of each piece of object information in the third group of object information, a second predetermined number of pieces of object information with higher similarity scores are selected from the third group of object information. According to some other embodiments, the target information whose similarity score in the third group of target information is higher than the similarity threshold is selected as the first group of target information. According to still other embodiments, all object information in the third group of object information is selected as the first group of object information, and the information in the first group of object information is sorted based on similarity scores.
在步骤S313处,基于第一组目标信息,生成检索结果。At step S313, a search result is generated based on the first group of target information.
根据一些实施例,基于第一组目标信息的相似度分数,生成检索结果。According to some embodiments, the retrieval result is generated based on the similarity score of the first group of target information.
在本公开所提供的用于多模态信息库的检索方法中,根据一些实施例,步骤S303-S307在步骤S301与步骤S309之间执行。根据另一些实施例,还可以以其它顺序执行步骤S301-S309,例如,先执行步骤S303-S307,再执行步骤S301,接着执行步骤S309。In the retrieval method for a multimodal information base provided in the present disclosure, according to some embodiments, steps S303-S307 are performed between steps S301 and S309. According to other embodiments, steps S301-S309 may also be performed in other order, for example, steps S303-S307 are performed first, then step S301 is performed, and then step S309 is performed.
在如本公开实施例所述的用于多模态信息库的检索方法中,由于从图像信息中的主体信息提取单模态图像特征,并且,基于主体信息的单模态图像特征与多条目标信息中的每条目标信息的单模态图像特征的相似度初步筛选多条目标中的第三组目标信息,提高了当用户仅输入图像信息时的检索准确度。In the retrieval method for the multimodal information base as described in the embodiment of the present disclosure, since the single-modal image features are extracted from the subject information in the image information, and the single-modal image features based on the subject information are combined with multiple The similarity of the single-modal image features of each piece of target information in the target information preliminarily screens the third group of target information among multiple targets, improving the retrieval accuracy when the user only inputs image information.
根据一些实施例,如本公开所述的用于多模态信息库的检索方法还包括,在基于第一组目标信息,生成检索结果之前:响应于接收到包括第一模态信息和第二模态信息的检索信息,使用第一多模态特征提取模块,提取检索信息的第一模态特征,并且,使用第二多模态特征提取模块,提取检索信息的第二模态特征;基于检索信息的第一模态特征和第二模态特征,生成检索信息的多模态特征;以及基于检索信息的多模态特征与多条目标信息中的每条目标信息的多模态特征的相似度,选择多条目标信息中的第一组目标信息,其中,每条目标信息的多模态特征为基于该目标信息的第一模态特征和第二模态特征所生成的。According to some embodiments, the retrieval method for a multimodal information base as described in the present disclosure further includes, before generating retrieval results based on the first set of target information: in response to receiving For the retrieved information of the modal information, the first multimodal feature extraction module is used to extract the first modal feature of the retrieved information, and the second multimodal feature extraction module is used to extract the second modal feature of the retrieved information; based on The first modal feature and the second modal feature of the retrieved information are generated to generate a multi-modal feature of the retrieved information; and based on the multi-modal feature of the retrieved information and the multi-modal feature of each piece of target information The similarity is to select a first group of target information among multiple pieces of target information, wherein the multimodal features of each piece of target information are generated based on the first and second modal features of the target information.
图4示出了根据本公开的实施例的用于多模态信息库的检索方法400的流程图。FIG. 4 shows a flowchart of a retrieval method 400 for a multimodal information base according to an embodiment of the present disclosure.
在步骤S401处,判断接收到的检索信息是否包括第一模态信息和第二模态信息,其中,响应于判断结果为“否”,进行到步骤S403,响应于判断结果为“是”,进行到步骤S407。At step S401, it is judged whether the received retrieval information includes the first modality information and the second modality information, wherein, in response to the judgment result being “No”, proceed to step S403, in response to the judgment result being “Yes”, Proceed to step S407.
在步骤S403处,响应于接收到包括第一模态信息的检索信息,使用第一多模态特征提取模块,从检索信息的第一模态信息中提取检索信息的第一模态特征。根据一些实施例,可以与步骤S201类似地执行步骤S403。At step S403, in response to receiving the retrieval information including the first modality information, the first multimodal feature extraction module is used to extract the first modality feature of the retrieval information from the first modality information of the retrieval information. According to some embodiments, step S403 may be performed similarly to step S201.
在步骤S405处,基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择多条目标信息中的第一组目标信息。根据一些实施例,可以与步骤S203类似地执行步骤S405。At step S405, multiple pieces of target information are selected based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information The first set of target information in . According to some embodiments, step S405 may be performed similarly to step S203.
在步骤S407处,使用第一多模态特征提取模块,提取检索信息的第一模态特征,并且,使用第二多模态特征提取模块,提取检索信息的第二模态特征。At step S407, the first multimodal feature extraction module is used to extract the first modal feature of the retrieved information, and the second multimodal feature extraction module is used to extract the second modal feature of the retrieved information.
在步骤S409处,基于检索信息的第一模态特征和第二模态特征,生成检索信息的多模态特征。At step S409, based on the first modal feature and the second modal feature of the retrieved information, a multi-modal feature of the retrieved information is generated.
根据一些实施例,生成检索信息的多模态特征包括:对于检索信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及对检索信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到检索信息的多模态特征。According to some embodiments, generating the multimodal feature of the retrieved information includes: for each of the first and second modal features of the retrieved information, multiplying the modal feature by the weight corresponding to the modal feature , to obtain the product corresponding to the modal feature; and normalize the sum of the products corresponding to the first modal feature and the second modal feature of the retrieved information to obtain the multimodal feature of the retrieved information.
在步骤S411处,基于检索信息的多模态特征与多条目标信息中的每条目标信息的多模态特征的相似度,选择多条目标信息中的第一组目标信息。At step S411, based on the similarity between the multimodal features of the retrieved information and the multimodal features of each of the multiple pieces of target information, a first group of target information among the multiple pieces of target information is selected.
在步骤S413处,基于第一组目标信息,生成检索结果。At step S413, a search result is generated based on the first group of target information.
根据一些实施例,当检索信息包括第一模态信息时,步骤S413可以与步骤S205类似地执行。根据另一些实施例,当检索信息包括第一模态信息和第二模态信息时,可以基于检索信息的多模态特征与第一组目标信息中的每条目标信息的多模态特征的相似度,生成检索结果。According to some embodiments, when the retrieved information includes the first modality information, step S413 may be performed similarly to step S205. According to some other embodiments, when the search information includes the first modality information and the second modality information, it may be based on the multimodal features of the search information and the multimodal features of each piece of target information in the first group of target information similarity to generate search results.
根据一些实施例,第一多模态特征提取模块和第二多模态特征提取模块是基于损失函数进行训练所得到的,其中,损失函数是由第一多模态特征提取模块和第二多模态特征提取模块分别提取的特征之间的相似度的函数。According to some embodiments, the first multimodal feature extraction module and the second multimodal feature extraction module are obtained by training based on a loss function, wherein the loss function is obtained by the first multimodal feature extraction module and the second multimodal feature extraction module. A function of the similarity between the features extracted by the modal feature extraction module respectively.
通过设置损失函数为由第一多模态特征提取模块和第二多模态特征提取模块分别提取的特征之间的相似度的函数,可以将第一多模态特征提取模块和第二多模态特征提取模块一起训练,以缩短模态信息互相匹配的样本的模态特征之间的距离,并且拉长模态信息不匹配的样本的模态特征之间的距离。By setting the loss function as the function of the similarity between the features extracted respectively by the first multimodal feature extraction module and the second multimodal feature extraction module, the first multimodal feature extraction module and the second multimodal feature extraction module can be combined The modal feature extraction module is trained together to shorten the distance between the modal features of the samples whose modal information matches each other, and lengthen the distance between the modal features of the samples whose modal information does not match.
本公开的实施例还提供了一种用于多模态信息库的管理方法,方法包括:响应于接收到包括第一模态信息和第二模态信息的入库信息,使用第一多模态特征提取模块,从入库信息的第一模态信息中提取入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从入库信息的第二模态信息中提取入库信息的第二模态特征;基于入库信息的第一模态特征和第二模态特征,计算入库信息的多模态特征;基于入库信息的第一模态特征、第二模态特征和多模态特征,生成多模态信息库中对应于入库信息的一个或多个检索对象;以及响应于接收到检索信息,执行如本公开所述的检索方法。An embodiment of the present disclosure also provides a management method for a multimodal information library, the method includes: in response to receiving storage information including first and second modality information, using the first multimodal The modal feature extraction module extracts the first modal feature of the warehousing information from the first modal information of the warehousing information, and uses the second multi-modal feature extraction module to extract the information from the second modal information of the warehousing information Extract the second modal features of the storage information; calculate the multi-modal features of the storage information based on the first and second modal features of the storage information; For bimodal features and multimodal features, generate one or more retrieval objects corresponding to the stored information in the multimodal information base; and execute the retrieval method as described in the present disclosure in response to receiving retrieval information.
图5示出了根据本公开的实施例的用于多模态信息库的管理方法500的流程图。Fig. 5 shows a flowchart of a management method 500 for a multimodal information base according to an embodiment of the present disclosure.
在步骤S501处,响应于接收到包括第一模态信息和第二模态信息的入库信息,使用第一多模态特征提取模块,从入库信息的第一模态信息中提取入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从入库信息的第二模态信息中提取入库信息的第二模态特征。At step S501, in response to receiving the storage information including the first modality information and the second modality information, use the first multimodal feature extraction module to extract the storage information from the first modality information of the storage information The first modal feature of the information, and using the second multi-modal feature extraction module, extracts the second modal feature of the storage information from the second modal information of the storage information.
在步骤S503处,基于入库信息的第一模态特征和第二模态特征,计算入库信息的多模态特征。At step S503, based on the first modal feature and the second modal feature of the warehousing information, the multi-modal features of the warehousing information are calculated.
根据一些实施例,计算入库信息的多模态特征包括:对于入库信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及对入库信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到检索信息的多模态特征。According to some embodiments, calculating the multimodal feature of the storage information includes: for each of the first and second modal features of the storage information, multiplying the modal feature by the modal feature corresponding to the modal feature The weight corresponding to the modal feature is obtained to obtain the product corresponding to the modal feature; and the sum of the products corresponding to the first modal feature and the second modal feature of the storage information is normalized to obtain the multi-modal feature of the retrieval information.
在步骤S505处,基于入库信息的第一模态特征、第二模态特征和多模态特征,生成多模态信息库中对应于入库信息的一个或多个检索对象。At step S505, based on the first modality feature, the second modality feature and the multimodal feature of the warehouse-in information, one or more retrieval objects corresponding to the warehouse-in information in the multimodal information base are generated.
在一些实施例中,将入库信息的每种特征加入多模态信息库的对应索引文件中,例如,将第一模态特征加入对应于第一模态特征的检索文件中,而将第二模态特征加入对应于第二模态特征的检索文件中,以便于对各种特征 进行独立检索。在一些实施例,在多模态信息库的检索文件中,创建对应于入库信息的检索对象,其中,该检索对象包括该入库信息的对应特征、ID、相关网络链接等。In some embodiments, each feature of the storage information is added to the corresponding index file of the multimodal information base, for example, the first modality feature is added to the retrieval file corresponding to the first modality feature, and the second Two-modal features are added to the search file corresponding to the second-modal feature, so that each feature can be searched independently. In some embodiments, in the retrieval file of the multimodal information base, a retrieval object corresponding to the storage information is created, wherein the retrieval object includes the corresponding feature, ID, relevant network link, etc. of the storage information.
在步骤S507处,响应于接收到检索信息,在多模态信息库中检索对应于检索信息的目标信息。根据一些实施例,执行如本公开所述的检索方法,以在模态检索库中检索对应于检索信息的目标信息。At step S507, in response to receiving the retrieval information, the target information corresponding to the retrieval information is retrieved in the multimodal information base. According to some embodiments, the retrieval method as described in the present disclosure is executed to retrieve target information corresponding to the retrieval information in the modal retrieval library.
在一些实施例中,第一模态信息为图像信息和文本信息中的任一个,第二模态信息为图像信息和文本信息中的另一个,其中,如本公开所述的用于多模态信息库的管理方法还包括,在生成多模态信息库中对应于入库信息的一个或多个检索对象之前:使用主体检测模块,从入库信息的图像信息中,提取入库信息的一条或多条主体信息;以及使用第一多模态特征提取模块、第二多模态特征提取模块和图像特征提取模块,从入库信息的一条或多条主体信息中提取入库信息的单模态图像特征,并且,其中,生成多模态信息库中对应于入库信息的一个或多个检索对象包括:基于入库信息的第一模态特征、第二模态特征、多模态特征和单模态图像特征,生成多模态信息库中对应于入库信息的一个或多个检索对象。In some embodiments, the first modality information is any one of image information and text information, and the second modality information is the other of image information and text information, wherein, as described in the present disclosure, for multimodal The management method of the state information base also includes, before generating one or more retrieval objects corresponding to the storage information in the multi-modal information base: using the subject detection module to extract the content of the storage information from the image information of the storage information One or more pieces of subject information; and using the first multimodal feature extraction module, the second multimodal feature extraction module and the image feature extraction module to extract the storage information from one or more pieces of subject information of the storage information. Modal image features, and, wherein, generating one or more retrieval objects corresponding to the storage information in the multi-modal information base includes: the first modality feature, the second modality feature, the multi-modality feature based on the storage information features and single-modal image features, and generate one or more retrieval objects corresponding to the stored information in the multi-modal information base.
图6示出了根据本公开的实施例的用于多模态信息库的管理方法600的流程图。Fig. 6 shows a flowchart of a management method 600 for a multimodal information base according to an embodiment of the present disclosure.
在步骤S601处,响应于接收到包括第一模态信息和第二模态信息的入库信息,使用第一多模态特征提取模块,从入库信息的第一模态信息中提取入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从入库信息的第二模态信息中提取入库信息的第二模态特征。根据一些实施例,步骤S601可以与步骤S501类似地执行。At step S601, in response to receiving the storage information including the first modality information and the second modality information, use the first multimodal feature extraction module to extract the storage information from the first modality information of the storage information The first modal feature of the information, and using the second multi-modal feature extraction module, extracts the second modal feature of the storage information from the second modal information of the storage information. According to some embodiments, step S601 may be performed similarly to step S501.
在步骤S603处,基于入库信息的第一模态特征和第二模态特征,计算入库信息的多模态特征。根据一些实施例,步骤S603可以与步骤S503类似地执行。At step S603, based on the first modal feature and the second modal feature of the warehousing information, the multi-modal features of the warehousing information are calculated. According to some embodiments, step S603 may be performed similarly to step S503.
在步骤S605处,使用主体检测模块,从入库信息的图像信息中,提取入库信息的一条或多条主体信息。At step S605, use the subject detection module to extract one or more pieces of subject information of the warehouse-in information from the image information of the warehouse-in information.
根据一些实施例,可以根据如参考步骤S303所描述的使用主体检测模块,从检索信息的第一模态信息中提取一条或多条主体信息类似的方式,从从入库信息的图像信息中,提取入库信息的一条或多条主体信息。According to some embodiments, one or more pieces of subject information can be extracted from the first modality information of the retrieved information in a manner similar to that described with reference to step S303 using the subject detection module, from the image information of the storage information, Extract one or more pieces of subject information of the storage information.
在步骤S607处,使用第一多模态特征提取模块、第二多模态特征提取模块和图像特征提取模块,从入库信息的一条或多条主体信息中提取入库信息的单模态图像特征。At step S607, use the first multi-modal feature extraction module, the second multi-modal feature extraction module and the image feature extraction module to extract a single-modal image of the storage information from one or more pieces of subject information of the storage information feature.
在步骤S609处,基于入库信息的第一模态特征、第二模态特征、多模态特征和单模态图像特征,生成多模态信息库中对应于入库信息的一个或多个检索对象。At step S609, based on the first modality feature, the second modality feature, the multimodal feature and the single modality image feature of the warehouse-in information, one or more Retrieve the object.
根据一些实施例,与上面参考步骤S507描述的类似,将入库信息的每种特征加入多模态信息库的对应索引文件中,并且,在多模态信息库的每个检索文件中,创建对应于入库信息的检索对象。According to some embodiments, similar to that described above with reference to step S507, each feature of the storage information is added to the corresponding index file of the multi-modal information base, and, in each retrieval file of the multi-modal information base, a Corresponds to the retrieval object of the storage information.
在步骤S611处,响应于接收到检索信息,在多模态信息库中检索对应于检索信息的目标信息。根据一些实施例,执行如本公开所述的检索方法,以在模态检索库中检索对应于检索信息的目标信息。At step S611, in response to receiving the retrieval information, the target information corresponding to the retrieval information is retrieved in the multimodal information base. According to some embodiments, the retrieval method as described in the present disclosure is executed to retrieve target information corresponding to the retrieval information in the modal retrieval library.
在本公开所提供的用于多模态信息库的检索方法中,根据一些实施例,步骤S603-S607在步骤S601与步骤S609之间执行。根据另一些实施例,还可以以其它顺序执行步骤S601-S609,例如,先执行步骤S603-S607,再执行步骤S601,接着执行步骤S609。In the retrieval method for a multimodal information base provided in the present disclosure, according to some embodiments, steps S603-S607 are executed between steps S601 and S609. According to other embodiments, steps S601-S609 may also be executed in other order, for example, step S603-S607 is executed first, then step S601 is executed, and then step S609 is executed.
根据一些实施例,第一多模态特征提取模块为多模态图像提取模块和多模态文本提取模块中的任一个,第二多模态特征提取模块为多模态图像提取模块和多模态文本提取模块中的另一个。According to some embodiments, the first multimodal feature extraction module is any one of a multimodal image extraction module and a multimodal text extraction module, and the second multimodal feature extraction module is a multimodal image extraction module and a multimodal text extraction module. Another one in the dynamic text extraction module.
根据一些实施例,从入库信息的一条或多条主体信息中提取入库信息的单模态图像特征包括:对于从入库信息的图像信息和一条或多条主体信息中的每一个,使用多模态图像提取模块,提取该信息的多模态图像特征;使用多模态文本提取模块,从入库信息的文本信息中提取入库信息的多模态文本特征;对于入库信息的图像信息和一条或多条主体信息中的每一个,计算该信息的多模态图像特征与入库信息的多模态文字特征的相似度,作为该信息的相似度分数;从入库信息的图像信息和一条或多条主体信息中,选择具有 最大相似度分数的信息;以及使用图像特征提取模块,从具有最大相似度分数的信息中提取入库信息的单模态图像特征。According to some embodiments, extracting the single-modal image feature of the warehouse-in information from one or more pieces of subject information of the warehouse-in information includes: for each of the image information and one or more pieces of subject information of the warehouse-in information, using The multimodal image extraction module extracts the multimodal image features of the information; uses the multimodal text extraction module to extract the multimodal text features of the storage information from the text information of the storage information; for the image storage information For each of the information and one or more pieces of subject information, calculate the similarity between the multimodal image features of the information and the multimodal text features of the stored information as the similarity score of the information; from the image of the stored information From the information and one or more pieces of subject information, select the information with the largest similarity score; and use the image feature extraction module to extract the single-modal image features of the stored information from the information with the largest similarity score.
图7示出了根据本公开的实施例的在图6的方法中从入库信息的一条或多条主体信息中提取入库信息的单模态图像特征(步骤S607)的示例过程的流程图。FIG. 7 shows a flow chart of an example process of extracting single-modal image features of warehouse-in information from one or more pieces of subject information of warehouse-in information (step S607) in the method of FIG. 6 according to an embodiment of the present disclosure. .
在步骤S701处,对于从入库信息的图像信息和一条或多条主体信息中的每一个,使用多模态图像提取模块,提取该信息的多模态图像特征。At step S701, for each of the image information and one or more pieces of subject information from the storage information, use a multi-modal image extraction module to extract multi-modal image features of the information.
在步骤S703处,使用多模态文本提取模块,从入库信息的文本信息中提取入库信息的多模态文本特征。At step S703, a multimodal text extraction module is used to extract multimodal text features of the storage information from the text information of the storage information.
在步骤S705处,对于入库信息的图像信息和一条或多条主体信息中的每一个,计算该信息的多模态图像特征与入库信息的多模态文字特征的相似度,作为该信息的相似度分数。At step S705, for each of the image information of the storage information and one or more pieces of subject information, the similarity between the multimodal image features of the information and the multimodal text features of the storage information is calculated as the information similarity score.
在步骤S707处,从入库信息的图像信息和一条或多条主体信息中,选择具有最大相似度分数的信息。At step S707, from the image information and one or more pieces of subject information of the storage information, the information with the largest similarity score is selected.
在步骤S709处,使用图像特征提取模块,从具有最大相似度分数的信息中提取入库信息的单模态图像特征。根据一些实施例,将具有最大相似度分数的信息作为入库信息的主体信息保存在多模态信息库中。At step S709, use the image feature extraction module to extract the single-modal image features of the storage information from the information with the maximum similarity score. According to some embodiments, the information with the largest similarity score is stored in the multimodal information database as the subject information of the incoming information.
在如本公开所述的用于多模态信息库的管理方法,由于选择入库信息的图像信息和所检测到的主体信息中与入库信息的文本信息最接近的信息,以提取入库信息的单模态图像特征,确保了所提取的入库信息的单模态图像特征的准确性和图文相符性。In the management method for the multi-modal information base as described in the present disclosure, since the image information of the storage information and the detected subject information are selected, the information closest to the text information of the storage information is selected to extract the storage information The single-modal image feature of the information ensures the accuracy and consistency of the image and text of the extracted single-modal image feature of the stored information.
图8示出了根据本公开的实施例的用于多模态信息库的检索装置800的结构框图。Fig. 8 shows a structural block diagram of a retrieval device 800 for a multimodal information base according to an embodiment of the present disclosure.
根据一些实施例,检索装置800包括:检索特征提取模块801,被配置为:响应于接收到包括所述第一模态信息的检索信息,使用第一多模态特征提取模块,从所述检索信息的第一模态信息中提取所述检索信息的第一模态特征;目标匹配模块802,被配置为:基于所述检索信息的第一模态特征与所述多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择所述多条目标信息中的第一组目标信息,其中,每条目标信息的第一模态特征为使用所述第一多模态特征提取模块从该目标信 息的第一模态信息中提取的,每条目标信息的第二模态特征为使用第二多模态特征提取模块从该目标信息的第二模态信息中提取的;以及检索结果生成模块803,被配置为:基于所述第一组目标信息,生成检索结果。According to some embodiments, the retrieval device 800 includes: a retrieval feature extraction module 801 configured to: use the first multimodal feature extraction module to extract from the retrieval feature in response to receiving the retrieval information including the first modality information Extract the first modal feature of the retrieval information from the first modal information of the information; the target matching module 802 is configured to: based on the first modal feature of the retrieval information and each of the multiple pieces of target information The similarity between the first modal feature and the second modal feature of each piece of target information, select the first group of target information among the multiple pieces of target information, wherein the first modal feature of each piece of target information is extracted from the first modality information of the target information by using the first multimodal feature extraction module, and the second modality feature of each piece of target information is extracted from the target information by using the second multimodal feature extraction module extracted from the second modality information; and the retrieval result generating module 803 is configured to: generate a retrieval result based on the first group of target information.
根据一些实施例,多模态信息库包括多个包括第一模态信息和第二模态信息的目标信息。According to some embodiments, the multimodal information base includes a plurality of target information including first modality information and second modality information.
根据一些实施例,目标匹配模块802包括:第二目标信息选择模块,被配置为:基于检索信息的第一模态特征与多条目标信息中的每条目标信息的第一模态特征的相似度,选择多条目标信息中的第二组目标信息;以及第一目标信息选择模块,被配置为:基于检索信息的第一模态特征与第二组目标中的每条目标信息的第二模态特征的相似度,从第二组目标信息中选择第一组目标信息。According to some embodiments, the target matching module 802 includes: a second target information selection module configured to: based on the similarity between the first modality feature of the retrieved information and the first modality feature of each piece of target information in the multiple pieces of target information degree, select the second group of target information among the multiple pieces of target information; and the first target information selection module is configured to: based on the first modality feature of the retrieved information and the second of each target information in the second group of targets The similarity of the modal features selects the first set of target information from the second set of target information.
根据一些实施例,第一模态信息为图像信息,其中,检索装置800还包括主体特征提取模块,包括:主体检测模块,被配置为:使用主体检测模块,从检索信息的第一模态信息中提取一条或多条主体信息;单模态图像特征提取模块,被配置为:对于每条主体信息,使用图像特征提取模块,从该主体信息中提取该主体信息的单模态图像特征;以及第三目标信息选择模块,被配置为:基于一条或多条主体信息的单模态图像特征与多条目标信息中的每条目标信息的单模态图像特征的相似度,选择多条目标信息中的第三组目标信息,其中,每条目标信息的单模态图像特征为使用图像特征提取模块从该目标信息的第一模态信息中提取的,并且其中,目标匹配模块802包括:相似度计算模块,被配置为:对于第三组目标信息中的每条目标信息,基于检索信息的第一模态特征与该目标信息的第一模态特征的相似度和检索信息的第一模态特征与该目标信息的第二模态特征的相似度,计算该目标信息的相似度分数;以及第一目标信息选择模块,被配置为:基于第三组目标信息中的每条目标信息的相似度分数,从第三组目标信息中选择第一组目标信息。According to some embodiments, the first modality information is image information, wherein the retrieval device 800 further includes a subject feature extraction module, including: a subject detection module configured to: use the subject detection module to retrieve information from the first modality information Extract one or more pieces of subject information; the single-modal image feature extraction module is configured to: for each piece of subject information, use the image feature extraction module to extract the single-modal image feature of the subject information from the subject information; and The third target information selection module is configured to: select multiple pieces of target information based on the similarity between the unimodal image features of one or more pieces of subject information and the unimodal image features of each piece of target information in the multiple pieces of target information The third group of target information in , wherein the single-modal image feature of each target information is extracted from the first modality information of the target information using the image feature extraction module, and wherein the target matching module 802 includes: similar A degree calculation module configured to: for each piece of target information in the third group of target information, based on the similarity between the first modality feature of the retrieved information and the first modality feature of the target information and the first modality of the retrieved information The similarity between the modal feature and the second modal feature of the target information is used to calculate the similarity score of the target information; and the first target information selection module is configured to: based on each piece of target information in the third group of target information The similarity score selects the first set of target information from the third set of target information.
根据一些实施例,一条或多条主体信息包括多条主体信息,并且其中,第一目标信息选择模块包括:主体信息匹配模块,被配置为:对于每条主体信息,基于该主体信息的单模态图像特征与目标信息中的每条目标信息的单模态图像特征的相似度,选择多条目标信息中对应于该主体信息的多条目标信息;以及选择对应于每条主体信息的多条目标信息,作为第三组目标信息。According to some embodiments, the one or more pieces of subject information include multiple pieces of subject information, and wherein the first target information selection module includes: a subject information matching module configured to: for each piece of subject information, based on the single mode of the subject information The similarity between the modal image features and the unimodal image features of each piece of target information in the target information, select multiple pieces of target information corresponding to the subject information from the multiple pieces of target information; and select multiple pieces of target information corresponding to each piece of subject information Target information, as the third group of target information.
根据一些实施例,检索装置800还包括多模态特征检索模块,被配置为:多模态子特征提取模块,被配置为:响应于接收到包括第一模态信息和第二模态信息的检索信息,使用第一多模态特征提取模块,提取检索信息的第一模态特征,并且,使用第二多模态特征提取模块,提取检索信息的第二模态特征;多模态特征生成模块,被配置为:基于检索信息的第一模态特征和第二模态特征,生成检索信息的多模态特征;以及第一目标信息选择模块,被配置为:基于检索信息的多模态特征与多条目标信息中的每条目标信息的多模态特征的相似度,选择多条目标信息中的第一组目标信息,其中,每条目标信息的多模态特征为基于该目标信息的第一模态特征和第二模态特征所生成的。According to some embodiments, the retrieval device 800 further includes a multimodal feature retrieval module configured as: a multimodal sub-feature extraction module configured to: respond to receiving information including the first modality information and the second modality information To retrieve information, use the first multimodal feature extraction module to extract the first modal feature of the retrieved information, and use the second multimodal feature extraction module to extract the second modal feature of the retrieved information; multimodal feature generation A module configured to: generate a multimodal feature of the retrieved information based on the first modal feature and a second modal feature of the retrieved information; and a first target information selection module configured to: based on the multimodal feature of the retrieved information The similarity between the feature and the multimodal feature of each piece of target information in the multiple pieces of target information is selected, and the first group of target information in the multiple pieces of target information is selected, wherein the multimodal feature of each piece of target information is based on the target information generated by the first modal feature and the second modal feature.
根据一些实施例,多模态特征生成模块包括:乘积计算模块,被配置为:对于检索信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及归一化模块,被配置为:对检索信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到检索信息的多模态特征。According to some embodiments, the multimodal feature generation module includes: a product calculation module configured to: for each of the first and second modal features of the retrieved information, multiply the modal feature by the modal The weight corresponding to the modal feature is obtained to obtain the product corresponding to the modal feature; and the normalization module is configured to: normalize the sum of the products corresponding to the first modal feature and the second modal feature of the retrieved information Integrate to obtain the multimodal features of the retrieved information.
根据一些实施例,第一多模态特征提取模块和第二多模态特征提取模块是基于损失函数进行训练所得到的,其中,损失函数是由第一多模态特征提取模块和第二多模态特征提取模块分别提取的特征之间的相似度的函数。According to some embodiments, the first multimodal feature extraction module and the second multimodal feature extraction module are obtained by training based on a loss function, wherein the loss function is obtained by the first multimodal feature extraction module and the second multimodal feature extraction module. A function of the similarity between the features extracted by the modal feature extraction module respectively.
根据一些实施例,其中,第一模态信息为图像信息和文本信息中的任一个,第二模态信息为图像信息和文本信息中的另一个。According to some embodiments, the first modality information is any one of image information and text information, and the second modality information is the other of image information and text information.
图9示出了根据本公开的实施例的用于多模态信息库的管理装置900的结构框图。Fig. 9 shows a structural block diagram of a management device 900 for a multimodal information base according to an embodiment of the present disclosure.
根据一些实施例,如图9所示,管理装置900包括:入库信息提取模块901,被配置为:响应于接收到包括第一模态信息和第二模态信息的入库信息,使用第一多模态特征提取模块,从所述入库信息的第一模态信息中提取所述入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从所述入库信息的第二模态信息中提取所述入库信息的第二模态特征;多模态信息生成模块902,被配置为:基于所述入库信息的第一模态特征和第二模态特征,计算所述入库信息的多模态特征;检索对象生成模块903,被配置为:基于所述入库信息的第一模态特征、第二模态特征和多模态特征,生成所述 多模态信息库中对应于所述入库信息的一个或多个检索对象;以及如本公开所述的用于多模态信息库的检索装置800。According to some embodiments, as shown in FIG. 9 , the management device 900 includes: a warehouse-in information extraction module 901 configured to: in response to receiving the warehouse-in information including the first modal information and the second modal information, use the first A multi-modal feature extraction module extracts the first modal feature of the warehousing information from the first modal information of the warehousing information, and uses a second multi-modal feature extraction module to extract the Extract the second modal feature of the storage information from the second modal information of the storage information; the multi-modal information generating module 902 is configured to: based on the first modal feature and the second modal feature of the storage information modal features, and calculate the multimodal features of the storage information; the retrieval object generation module 903 is configured to: generate One or more search objects in the multi-modal information base corresponding to the storage information; and the retrieval device 800 for the multi-modal information base as described in the present disclosure.
根据一些实施例,第一模态信息为图像信息和文本信息中的任一个,第二模态信息为图像信息和文本信息中的另一个,其中,管理装置900还包括:入库主体检测模块,被配置为:使用主体检测模块,从入库信息的图像信息中,提取入库信息的一条或多条主体信息;以及入库特征提取模块,被配置为:使用第一多模态特征提取模块、第二多模态特征提取模块和图像特征提取模块,从入库信息的一条或多条主体信息中提取入库信息的单模态图像特征,并且,其中,检索对象生成模块903包括:检索对象生成子模块,被配置为:基于入库信息的第一模态特征、第二模态特征、多模态特征和单模态图像特征,生成多模态信息库中对应于入库信息的一个或多个检索对象。According to some embodiments, the first modality information is any one of image information and text information, and the second modality information is the other one of image information and text information, wherein the management device 900 further includes: a storage subject detection module , is configured to: use the subject detection module to extract one or more pieces of subject information of the storage information from the image information of the storage information; and the storage feature extraction module is configured to: use the first multi-modal feature extraction module, the second multimodal feature extraction module and the image feature extraction module extract the single-modal image features of the storage information from one or more pieces of subject information of the storage information, and, wherein, the retrieval object generation module 903 includes: The retrieval object generation sub-module is configured to: generate the corresponding information in the multimodal information base based on the first modal feature, the second modal feature, the multimodal feature and the single modal image feature of the warehousing information One or more search objects for .
根据一些实施例,第一多模态特征提取模块为多模态图像提取模块和多模态文本提取模块中的任一个,第二多模态特征提取模块为多模态图像提取模块和多模态文本提取模块中的另一个,并且其中,入库特征提取模块包括:入库图像提取模块,被配置为:对于从入库信息的图像信息和一条或多条主体信息中的每一个,使用多模态图像提取模块,提取该信息的多模态图像特征;入库文本提取模块,被配置为:使用多模态文本提取模块,从入库信息的文本信息中提取入库信息的多模态文本特征;入库主体选择模块,被配置为:对于入库信息的图像信息和一条或多条主体信息中的每一个,计算该信息的多模态图像特征与入库信息的多模态文字特征的相似度,作为该信息的相似度分数;以及从入库信息的图像信息和一条或多条主体信息中,选择具有最大相似度分数的信息;以及入库单模态提取模块,被配置为:使用图像特征提取模块,从具有最大相似度分数的信息中提取入库信息的单模态图像特征。According to some embodiments, the first multimodal feature extraction module is any one of a multimodal image extraction module and a multimodal text extraction module, and the second multimodal feature extraction module is a multimodal image extraction module and a multimodal text extraction module. Another one of the dynamic text extraction modules, and wherein, the storage feature extraction module includes: a storage image extraction module, configured to: for each of the image information from the storage information and one or more pieces of subject information, use The multimodal image extraction module extracts the multimodal image features of the information; the storage text extraction module is configured to: use the multimodal text extraction module to extract the multimodal information of the storage information from the text information of the storage information state text features; the storage subject selection module is configured to: for each of the image information of the storage information and one or more pieces of subject information, calculate the multi-modal image features of the information and the multi-modality of the storage information The similarity of the text features is used as the similarity score of the information; and the information with the largest similarity score is selected from the image information and one or more pieces of subject information of the stored information; and the single-mode extraction module of the stored information is used The configuration is: using the image feature extraction module to extract the single-mode image feature of the input information from the information with the maximum similarity score.
根据一些实施例,多模态信息生成模块902包括:入库乘积计算模块,被配置为:对于入库信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及入库归一化模块,被配置为:对入库信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到检索信息的多模态特征。According to some embodiments, the multimodal information generation module 902 includes: a storage product calculation module configured to: for each of the first and second modal features of the storage information, the modal feature Multiplied by the weight corresponding to the modal feature to obtain the product corresponding to the modal feature; The sum of the corresponding products is normalized to obtain the multimodal features of the retrieved information.
本公开的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。In the technical solution of the present disclosure, the acquisition, storage and application of the user's personal information involved are in compliance with relevant laws and regulations, and do not violate public order and good customs.
根据本公开的实施例,还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, an electronic device, a readable storage medium, and a computer program product are also provided.
根据一些实施例,本公开提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行如本公开所述的方法。According to some embodiments, the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor. The processor executes to enable at least one processor to perform the method as described in the present disclosure.
根据一些实施例,本公开提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行如本公开所述的方法。According to some embodiments, the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method as described in the present disclosure.
根据一些实施例,本公开提供了一种计算机程序产品,包括计算机程序,其中,计算机程序在被处理器执行时实现如本公开所述的方法。According to some embodiments, the present disclosure provides a computer program product comprising a computer program, wherein the computer program implements the method as described in the present disclosure when executed by a processor.
参考图10,现将描述可以作为本公开的服务器或客户端的电子设备1000的结构框图,其是可以应用于本公开的各方面的硬件设备的示例。电子设备旨在表示各种形式的数字电子的计算机设备,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。Referring to FIG. 10 , a structural block diagram of an electronic device 1000 that can serve as a server or a client of the present disclosure, which is an example of a hardware device that can be applied to various aspects of the present disclosure, will now be described. Electronic device is intended to mean various forms of digital electronic computing equipment, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图10所示,设备1000包括计算单元1001,其可以根据存储在只读存储器(ROM)1002中的计算机程序或者从存储单元1008加载到随机访问存储器(RAM)1003中的计算机程序,来执行各种适当的动作和处理。在RAM 1003中,还可存储设备1000操作所需的各种程序和数据。计算单元1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。As shown in FIG. 10 , the device 1000 includes a computing unit 1001 that can be executed according to a computer program stored in a read-only memory (ROM) 1002 or loaded from a storage unit 1008 into a random-access memory (RAM) 1003. Various appropriate actions and treatments. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The computing unit 1001, ROM 1002, and RAM 1003 are connected to each other through a bus 1004. An input/output (I/O) interface 1005 is also connected to the bus 1004 .
设备1000中的多个部件连接至I/O接口1005,包括:输入单元1006、输出单元1007、存储单元1008以及通信单元1009。输入单元1006可以是能向设备1000输入信息的任何类型的设备,输入单元1006可以接收输入的数字或字符信息,以及产生与电子设备的用户设置和/或功能控制有关的键信 号输入,并且可以包括但不限于鼠标、键盘、触摸屏、轨迹板、轨迹球、操作杆、麦克风和/或遥控器。输出单元1007可以是能呈现信息的任何类型的设备,并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。存储单元1008可以包括但不限于磁盘、光盘。通信单元1009允许设备1000通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据,并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信收发机和/或芯片组,例如蓝牙TM设备、1302.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。Multiple components in the device 1000 are connected to the I/O interface 1005 , including: an input unit 1006 , an output unit 1007 , a storage unit 1008 and a communication unit 1009 . The input unit 1006 can be any type of equipment capable of inputting information to the device 1000, the input unit 1006 can receive input digital or character information, and generate key signal input related to user settings and/or function control of the electronic device, and can Including but not limited to mouse, keyboard, touch screen, trackpad, trackball, joystick, microphone and/or remote control. The output unit 1007 may be any type of device capable of presenting information, and may include, but is not limited to, a display, a speaker, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 1008 may include, but is not limited to, a magnetic disk and an optical disk. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks, and may include but not limited to a modem, a network card, an infrared communication device, a wireless communication transceiver and/or a chipset , such as a Bluetooth™ device, a 1302.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.
计算单元1001可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元1001的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元1001执行上文所描述的各个方法和处理,例如方法200、300、300、400、500和/或600。例如,在一些实施例中,方法200、300、300、400、500和/或600可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1008。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1002和/或通信单元1009而被载入和/或安装到设备1000上。当计算机程序加载到RAM 1003并由计算单元1001执行时,可以执行上文描述的方法200、300、300、400、500和/或600的一个或多个步骤。备选地,在其他实施例中,计算单元1001可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行方法200、300、300、400、500和/或600。The computing unit 1001 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 1001 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 executes various methods and processes described above, such as methods 200 , 300 , 300 , 400 , 500 and/or 600 . For example, in some embodiments, methods 200 , 300 , 300 , 400 , 500 , and/or 600 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1008 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 1000 via the ROM 1002 and/or the communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of methods 200, 300, 300, 400, 500 and/or 600 described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to execute the methods 200 , 300 , 300 , 400 , 500 and/or 600 in any other suitable manner (for example, by means of firmware).
本文中以上描述的***和技术的各种实施方式可以在数字电子电路***、集成电路***、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上***的***(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程***上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储***、至少 一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储***、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的***和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
可以将此处描述的***和技术实施在包括后台部件的计算***(例如,作为数据服务器)、或者包括中间件部件的计算***(例如,应用服务器)、或者包括前端部件的计算***(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的***和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或 者前端部件的任何组合的计算***中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将***的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
计算机***可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以为分布式***的服务器,或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行、也可以顺序地或以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
虽然已经参照附图描述了本公开的实施例或示例,但应理解,上述的方法、***和设备仅仅是示例性的实施例或示例,本发明的范围并不由这些实施例或示例限制,而是仅由授权后的权利要求书及其等同范围来限定。实施例或示例中的各种要素可以被省略或者可由其等同要素替代。此外,可以通过不同于本公开中描述的次序来执行各步骤。进一步地,可以以各种方式组合实施例或示例中的各种要素。重要的是随着技术的演进,在此描述的很多要素可以由本公开之后出现的等同要素进行替换。Although the embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it should be understood that the above-mentioned methods, systems and devices are merely exemplary embodiments or examples, and the scope of the present invention is not limited by these embodiments or examples, but It is limited only by the appended claims and their equivalents. Various elements in the embodiments or examples may be omitted or replaced by equivalent elements thereof. Also, steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples can be combined in various ways. Importantly, as technology advances, many of the elements described herein may be replaced by equivalent elements appearing after this disclosure.

Claims (27)

  1. 一种用于多模态信息库的检索方法,其中,所述多模态信息库包括多条目标信息,每条目标信息包括第一模态信息和第二模态信息,所述方法包括:A retrieval method for a multimodal information base, wherein the multimodal information base includes multiple pieces of target information, each piece of target information includes first modality information and second modality information, and the method includes:
    响应于接收到包括所述第一模态信息的检索信息,使用第一多模态特征提取模块,从所述检索信息的第一模态信息中提取所述检索信息的第一模态特征;In response to receiving the retrieval information including the first modality information, using a first multimodal feature extraction module to extract the first modality feature of the retrieval information from the first modality information of the retrieval information;
    基于所述检索信息的第一模态特征与所述多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择所述多条目标信息中的第一组目标信息,其中,每条目标信息的第一模态特征为使用所述第一多模态特征提取模块从该目标信息的第一模态信息中提取的,每条目标信息的第二模态特征为使用第二多模态特征提取模块从该目标信息的第二模态信息中提取的;以及Selecting the plurality of items of targets based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each item of target information in the plurality of target information The first group of target information in the information, wherein the first modality feature of each target information is extracted from the first modality information of the target information by using the first multi-modal feature extraction module, and each target The second modality feature of the information is extracted from the second modality information of the target information using a second multimodal feature extraction module; and
    基于所述第一组目标信息,生成检索结果。Based on the first set of target information, a retrieval result is generated.
  2. 根据权利要求1所述的方法,其中,所述选择所述多条目标信息中的第一组目标信息包括:The method according to claim 1, wherein said selecting a first group of target information among said pieces of target information comprises:
    基于所述检索信息的第一模态特征与所述多条目标信息中的每条目标信息的第一模态特征的相似度,选择所述多条目标信息中的第二组目标信息;以及selecting a second set of target information among the plurality of pieces of target information based on a similarity between the first modality feature of the retrieved information and the first modality feature of each of the multiple pieces of target information; and
    基于所述检索信息的第一模态特征与所述第二组目标中的每条目标信息的第二模态特征的相似度,从所述第二组目标信息中选择所述第一组目标信息。selecting the first group of objects from the second group of object information based on the similarity between the first modal feature of the retrieved information and the second modal feature of each object information in the second group of objects information.
  3. 根据权利要求1所述的方法,其中,所述第一模态信息为图像信息,The method according to claim 1, wherein the first modality information is image information,
    其中,所述方法还包括,在所述选择所述多条目标信息中的第一组目标信息之前:Wherein, the method further includes, before the selection of the first group of target information in the plurality of pieces of target information:
    使用主体检测模块,从所述检索信息的第一模态信息中提取一条或多条主体信息;using a subject detection module to extract one or more pieces of subject information from the first modal information of the retrieved information;
    对于每条主体信息,使用图像特征提取模块,从该主体信息中提取该主体信息的单模态图像特征;以及For each piece of subject information, using an image feature extraction module to extract the unimodal image features of the subject information from the subject information; and
    基于所述一条或多条主体信息的单模态图像特征与所述多条目标信息中的每条目标信息的单模态图像特征的相似度,选择所述多条目标信息中的第三组目标信息,其中,每条目标信息的单模态图像特征为使用所述图像特征提取模块从该目标信息的第一模态信息中提取的,并且Based on the similarity between the unimodal image features of the one or more pieces of subject information and the unimodal image features of each piece of target information in the multiple pieces of target information, selecting a third group of the multiple pieces of target information Target information, wherein the single-modal image feature of each piece of target information is extracted from the first modality information of the target information by using the image feature extraction module, and
    其中,所述选择所述多条目标信息中的第一组目标信息包括:Wherein, the selecting the first group of target information in the multiple pieces of target information includes:
    对于所述第三组目标信息中的每条目标信息,基于所述检索信息的第一模态特征与该目标信息的第一模态特征的相似度和所述检索信息的第一模态特征与该目标信息的第二模态特征的相似度,计算该目标信息的相似度分数;以及For each piece of target information in the third group of target information, based on the similarity between the first modal feature of the search information and the first modal feature of the target information and the first modal feature of the search information calculating a similarity score of the target information for the similarity with the second modal feature of the target information; and
    基于所述第三组目标信息中的每条目标信息的相似度分数,从所述第三组目标信息中选择所述第一组目标信息。The first set of target information is selected from the third set of target information based on a similarity score of each piece of target information in the third set of target information.
  4. 根据权利要求3所述的方法,其中,所述一条或多条主体信息包括多条主体信息,并且其中,所述选择所述多条目标信息中的第三组目标信息包括:The method according to claim 3, wherein the one or more pieces of subject information comprise multiple pieces of subject information, and wherein the selecting a third group of target information among the plurality of pieces of target information comprises:
    对于每条主体信息,基于该主体信息的单模态图像特征与所述目标信息中的每条目标信息的单模态图像特征的相似度,选择所述多条目标信息中对应于该主体信息的多条目标信息;以及For each piece of subject information, based on the similarity between the unimodal image features of the subject information and the unimodal image features of each piece of target information in the target information, select one of the multiple pieces of target information corresponding to the subject information. Multiple pieces of target information for ; and
    选择对应于每条主体信息的多条目标信息,作为所述第三组目标信息。Multiple pieces of target information corresponding to each piece of subject information are selected as the third group of target information.
  5. 根据权利要求1所述的方法,还包括,在基于所述第一组目标信息,生成检索结果之前:The method according to claim 1, further comprising, before generating retrieval results based on the first set of target information:
    响应于接收到包括所述第一模态信息和所述第二模态信息的检索信息,使用所述第一多模态特征提取模块,提取所述检索信息的第一模态特征,并且,使用所述第二多模态特征提取模块,提取所述检索信息的第二模态特征;In response to receiving retrieval information comprising the first modality information and the second modality information, using the first multimodal feature extraction module, extracting a first modality feature of the retrieval information, and, using the second multimodal feature extraction module to extract the second modal features of the retrieved information;
    基于所述检索信息的第一模态特征和第二模态特征,生成所述检索信息的多模态特征;以及generating multimodal features of the retrieved information based on the first and second modal features of the retrieved information; and
    基于所述检索信息的多模态特征与所述多条目标信息中的每条目标信息的多模态特征的相似度,选择所述多条目标信息中的第一组目标信息,其中, 每条目标信息的多模态特征为基于该目标信息的第一模态特征和第二模态特征所生成的。Based on the similarity between the multimodal features of the retrieved information and the multimodal features of each piece of target information in the multiple pieces of target information, select the first group of target information in the multiple pieces of target information, where each The multimodal feature of an item of target information is generated based on the first modal feature and the second modal feature of the target information.
  6. 权利要求5所述的方法,其中,所述生成所述检索信息的多模态特征包括:The method of claim 5, wherein said generating the multimodal features of said retrieved information comprises:
    对于所述检索信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及对所述检索信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到所述检索信息的多模态特征。For each of the first modal feature and the second modal feature of the retrieved information, multiply the modal feature by the weight corresponding to the modal feature to obtain the product corresponding to the modal feature; and The sum of the products corresponding to the first modal feature and the second modal feature of the search information is normalized to obtain the multi-modal feature of the search information.
  7. 根据权利要求1-6中任一项所述的方法,其中,所述第一多模态特征提取模块和所述第二多模态特征提取模块是基于损失函数进行训练所得到的,其中,所述损失函数是由所述第一多模态特征提取模块和所述第二多模态特征提取模块分别提取的特征之间的相似度的函数。The method according to any one of claims 1-6, wherein the first multimodal feature extraction module and the second multimodal feature extraction module are obtained by training based on a loss function, wherein, The loss function is a function of the similarity between the features respectively extracted by the first multimodal feature extraction module and the second multimodal feature extraction module.
  8. 根据权利要求1-6中任一项所述的方法,其中,所述第一模态信息为图像信息和文本信息中的任一个,所述第二模态信息为图像信息和文本信息中的另一个。The method according to any one of claims 1-6, wherein the first modality information is any one of image information and text information, and the second modality information is either image information or text information another.
  9. 一种用于多模态信息库的管理方法,所述方法包括:A management method for a multimodal information base, the method comprising:
    响应于接收到包括第一模态信息和第二模态信息的入库信息,使用第一多模态特征提取模块,从所述入库信息的第一模态信息中提取所述入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从所述入库信息的第二模态信息中提取所述入库信息的第二模态特征;In response to receiving the storage information including the first modality information and the second modality information, using the first multimodal feature extraction module to extract the storage information from the first modality information of the storage information The first modal feature of the warehousing information, and using the second multi-modal feature extraction module, extracting the second modal feature of the warehousing information from the second modal information of the warehousing information;
    基于所述入库信息的第一模态特征和第二模态特征,计算所述入库信息的多模态特征;Based on the first modal feature and the second modal feature of the warehousing information, calculate the multi-modal features of the warehousing information;
    基于所述入库信息的第一模态特征、第二模态特征和多模态特征,生成所述多模态信息库中对应于所述入库信息的一个或多个检索对象;以及Based on the first modal feature, the second modal feature and the multimodal feature of the warehousing information, generate one or more retrieval objects corresponding to the warehousing information in the multimodal information base; and
    响应于接收到检索信息,执行根据权利要求1-8中任一项所述的检索方法。In response to receiving the retrieval information, the retrieval method according to any one of claims 1-8 is performed.
  10. 根据权利要求9所述的方法,其中,所述第一模态信息为图像信息和文本信息中的任一个,所述第二模态信息为图像信息和文本信息中的另一个,The method according to claim 9, wherein the first modality information is any one of image information and text information, and the second modality information is the other of image information and text information,
    其中,所述方法还包括,在所述生成所述多模态信息库中对应于所述入库信息的一个或多个检索对象之前:Wherein, the method further includes, before generating one or more retrieval objects corresponding to the storage information in the multimodal information base:
    使用主体检测模块,从所述入库信息的图像信息中,提取所述入库信息的一条或多条主体信息;以及Using a subject detection module to extract one or more pieces of subject information of the storage information from the image information of the storage information; and
    使用所述第一多模态特征提取模块、所述第二多模态特征提取模块和图像特征提取模块,从所述入库信息的一条或多条主体信息中提取所述入库信息的单模态图像特征,并且,Using the first multi-modal feature extraction module, the second multi-modal feature extraction module, and the image feature extraction module, extract the single or multiple pieces of subject information of the storage information from the storage information. modal image features, and,
    其中,所述生成所述多模态信息库中对应于所述入库信息的一个或多个检索对象包括:Wherein, the generating one or more retrieval objects corresponding to the storage information in the multimodal information base includes:
    基于所述入库信息的第一模态特征、第二模态特征、多模态特征和单模态图像特征,生成所述多模态信息库中对应于所述入库信息的一个或多个检索对象。Based on the first modality feature, the second modality feature, the multimodal feature and the single modality image feature of the storage information, generate one or more information corresponding to the storage information in the multimodal information base search object.
  11. 根据权利要求10所述的方法,其中,所述第一多模态特征提取模块为多模态图像提取模块和多模态文本提取模块中的任一个,所述第二多模态特征提取模块为所述多模态图像提取模块和所述多模态文本提取模块中的另一个,并且The method according to claim 10, wherein the first multimodal feature extraction module is any one of a multimodal image extraction module and a multimodal text extraction module, and the second multimodal feature extraction module being the other of the multimodal image extraction module and the multimodal text extraction module, and
    其中,所述从所述入库信息的一条或多条主体信息中提取所述入库信息的单模态图像特征包括:Wherein, the extraction of the single-modal image features of the storage information from one or more pieces of subject information of the storage information includes:
    对于从所述入库信息的图像信息和一条或多条主体信息中的每一个,使用所述多模态图像提取模块,提取该信息的多模态图像特征;For each of the image information and one or more pieces of subject information from the storage information, use the multi-modal image extraction module to extract the multi-modal image features of the information;
    使用所述多模态文本提取模块,从所述入库信息的文本信息中提取所述入库信息的多模态文本特征;Using the multimodal text extraction module to extract the multimodal text features of the storage information from the text information of the storage information;
    对于所述入库信息的图像信息和一条或多条主体信息中的每一个,计算该信息的多模态图像特征与所述入库信息的多模态文字特征的相似度,作为该信息的相似度分数;For each of the image information of the storage information and one or more pieces of subject information, calculate the similarity between the multi-modal image features of the information and the multi-modal text features of the storage information, as the information similarity score;
    从所述入库信息的图像信息和一条或多条主体信息中,选择具有最大相似度分数的信息;以及Selecting the information with the largest similarity score from the image information and one or more pieces of subject information of the storage information; and
    使用所述图像特征提取模块,从所述具有最大相似度分数的信息中提取所述入库信息的单模态图像特征。Using the image feature extraction module to extract the single-mode image feature of the storage information from the information with the maximum similarity score.
  12. 根据权利要求9所述的方法,其中,所述计算所述入库信息的多模态特征包括:The method according to claim 9, wherein said calculating the multimodal features of said storage information comprises:
    对于所述入库信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及对所述入库信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到所述检索信息的多模态特征。For each of the first modal feature and the second modal feature of the storage information, multiply the modal feature by the weight corresponding to the modal feature to obtain a product corresponding to the modal feature; and The sum of the products corresponding to the first modal feature and the second modal feature of the storage information is normalized to obtain the multi-modal feature of the retrieval information.
  13. 一种用于多模态信息库的检索装置,其中,所述多模态信息库包括多条目标信息,每条目标信息包括第一模态信息和第二模态信息,所述装置包括:A retrieval device for a multimodal information base, wherein the multimodal information base includes multiple pieces of target information, each piece of target information includes first modality information and second modality information, and the device includes:
    检索特征提取模块,被配置为:响应于接收到包括所述第一模态信息的检索信息,使用第一多模态特征提取模块,从所述检索信息的第一模态信息中提取所述检索信息的第一模态特征;The retrieval feature extraction module is configured to: in response to receiving the retrieval information including the first modality information, use a first multimodal feature extraction module to extract the retrieval information from the first modality information of the retrieval information Retrieving the first modality feature of the information;
    目标匹配模块,被配置为:基于所述检索信息的第一模态特征与所述多条目标信息中的每条目标信息的第一模态特征和第二模态特征中的每一个的相似度,选择所述多条目标信息中的第一组目标信息,其中,每条目标信息的第一模态特征为使用所述第一多模态特征提取模块从该目标信息的第一模态信息中提取的,每条目标信息的第二模态特征为使用第二多模态特征提取模块从该目标信息的第二模态信息中提取的;以及The target matching module is configured to: based on the similarity between the first modal feature of the retrieved information and each of the first modal feature and the second modal feature of each piece of target information in the multiple pieces of target information degree, select the first group of target information among the multiple pieces of target information, wherein the first modality feature of each target information is the first modality of the target information using the first multimodal feature extraction module information, the second modal feature of each piece of target information is extracted from the second modal information of the target information by using a second multi-modal feature extraction module; and
    检索结果生成模块,被配置为:基于所述第一组目标信息,生成检索结果。The retrieval result generation module is configured to: generate retrieval results based on the first set of target information.
  14. 根据权利要求13所述的装置,其中,所述目标匹配模块包括:The device according to claim 13, wherein the target matching module comprises:
    第二目标信息选择模块,被配置为:基于所述检索信息的第一模态特征与所述多条目标信息中的每条目标信息的第一模态特征的相似度,选择所述多条目标信息中的第二组目标信息;以及The second target information selection module is configured to: select the multiple pieces of target information based on the similarity between the first modal feature of the retrieved information and the first modal feature of each piece of target information in the multiple pieces of target information the second set of target information in the target information; and
    第一目标信息选择模块,被配置为:基于所述检索信息的第一模态特征与所述第二组目标中的每条目标信息的第二模态特征的相似度,从所述第二组目标信息中选择所述第一组目标信息。The first target information selection module is configured to: based on the similarity between the first modal feature of the retrieved information and the second modal feature of each target information in the second group of targets, from the second Select the first group of target information from the group of target information.
  15. 根据权利要求13所述的装置,其中,所述第一模态信息为图像信息,The device according to claim 13, wherein the first modality information is image information,
    其中,所述装置还包括主体特征提取模块,包括:Wherein, the device also includes a subject feature extraction module, including:
    主体检测模块,被配置为:使用主体检测模块,从所述检索信息的第一模态信息中提取一条或多条主体信息;The subject detection module is configured to: use the subject detection module to extract one or more pieces of subject information from the first modal information of the retrieved information;
    单模态图像特征提取模块,被配置为:对于每条主体信息,使用图像特征提取模块,从该主体信息中提取该主体信息的单模态图像特征;以及The unimodal image feature extraction module is configured to: for each piece of subject information, use the image feature extraction module to extract the unimodal image features of the subject information from the subject information; and
    第三目标信息选择模块,被配置为:基于所述一条或多条主体信息的单模态图像特征与所述多条目标信息中的每条目标信息的单模态图像特征的相似度,选择所述多条目标信息中的第三组目标信息,其中,每条目标信息的单模态图像特征为使用所述图像特征提取模块从该目标信息的第一模态信息中提取的,并且The third target information selection module is configured to: based on the similarity between the unimodal image features of the one or more pieces of subject information and the unimodal image features of each of the multiple pieces of target information, select The third group of target information among the plurality of pieces of target information, wherein the single-modal image feature of each piece of target information is extracted from the first modality information of the target information by using the image feature extraction module, and
    其中,所述目标匹配模块包括:Wherein, the target matching module includes:
    相似度计算模块,被配置为:对于所述第三组目标信息中的每条目标信息,基于所述检索信息的第一模态特征与该目标信息的第一模态特征的相似度和所述检索信息的第一模态特征与该目标信息的第二模态特征的相似度,计算该目标信息的相似度分数;以及The similarity calculation module is configured to: for each piece of target information in the third group of target information, based on the similarity between the first modality feature of the retrieved information and the first modality feature of the target information and the calculating the similarity score of the target information based on the similarity between the first modal feature of the retrieved information and the second modal feature of the target information; and
    第一目标信息选择模块,被配置为:基于所述第三组目标信息中的每条目标信息的相似度分数,从所述第三组目标信息中选择所述第一组目标信息。The first target information selection module is configured to: select the first group of target information from the third group of target information based on the similarity score of each piece of target information in the third group of target information.
  16. 根据权利要求15所述的装置,其中,所述一条或多条主体信息包括多条主体信息,并且其中,所述第一目标信息选择模块包括:The apparatus according to claim 15, wherein the one or more pieces of subject information comprise multiple pieces of subject information, and wherein the first target information selection module comprises:
    主体信息匹配模块,被配置为:The subject information matching module is configured as:
    对于每条主体信息,基于该主体信息的单模态图像特征与所述目标信息中的每条目标信息的单模态图像特征的相似度,选择所述多条目标信息中对应于该主体信息的多条目标信息;以及For each piece of subject information, based on the similarity between the unimodal image features of the subject information and the unimodal image features of each piece of target information in the target information, select one of the multiple pieces of target information corresponding to the subject information. Multiple pieces of target information for ; and
    选择对应于每条主体信息的多条目标信息,作为所述第三组目标信息。Multiple pieces of target information corresponding to each piece of subject information are selected as the third group of target information.
  17. 根据权利要求13所述的装置,还包括多模态特征检索模块,被配置为:The apparatus of claim 13, further comprising a multimodal feature retrieval module configured to:
    多模态子特征提取模块,被配置为:响应于接收到包括所述第一模态信息和所述第二模态信息的检索信息,使用所述第一多模态特征提取模块,提取所述检索信息的第一模态特征,并且,使用所述第二多模态特征提取模块,提取所述检索信息的第二模态特征;The multimodal sub-feature extraction module is configured to: in response to receiving the retrieval information including the first modality information and the second modality information, use the first multimodal feature extraction module to extract all The first modal feature of the retrieved information, and using the second multi-modal feature extraction module to extract the second modal feature of the retrieved information;
    多模态特征生成模块,被配置为:基于所述检索信息的第一模态特征和第二模态特征,生成所述检索信息的多模态特征;以及A multi-modal feature generation module configured to: generate multi-modal features of the retrieved information based on the first and second modal features of the retrieved information; and
    第一目标信息选择模块,被配置为:基于所述检索信息的多模态特征与所述多条目标信息中的每条目标信息的多模态特征的相似度,选择所述多条目标信息中的第一组目标信息,其中,每条目标信息的多模态特征为基于该目标信息的第一模态特征和第二模态特征所生成的。The first target information selection module is configured to: select the multiple pieces of target information based on the similarity between the multimodal features of the retrieved information and the multimodal features of each piece of target information in the multiple pieces of target information The first group of target information in , wherein the multimodal feature of each piece of target information is generated based on the first and second modal features of the target information.
  18. 权利要求17所述的装置,其中,所述多模态特征生成模块包括:The device of claim 17, wherein the multimodal feature generation module comprises:
    乘积计算模块,被配置为:对于所述检索信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及The product calculation module is configured to: for each of the first modal feature and the second modal feature of the retrieved information, multiply the modal feature by the weight corresponding to the modal feature to obtain the modal The product corresponding to the features; and
    归一化模块,被配置为:对所述检索信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到所述检索信息的多模态特征。The normalization module is configured to: normalize the sum of the products corresponding to the first modal feature and the second modal feature of the retrieved information to obtain the multimodal feature of the retrieved information.
  19. 根据权利要求13-18中任一项所述的装置,其中,所述第一多模态特征提取模块和所述第二多模态特征提取模块是基于损失函数进行训练所得到的,其中,所述损失函数是由所述第一多模态特征提取模块和所述第二多模态特征提取模块分别提取的特征之间的相似度的函数。The device according to any one of claims 13-18, wherein the first multimodal feature extraction module and the second multimodal feature extraction module are obtained by training based on a loss function, wherein, The loss function is a function of the similarity between the features respectively extracted by the first multimodal feature extraction module and the second multimodal feature extraction module.
  20. 根据权利要求13-18中任一项所述的装置,其中,所述第一模态信息为图像信息和文本信息中的任一个,所述第二模态信息为图像信息和文本信息中的另一个。The device according to any one of claims 13-18, wherein the first modality information is any one of image information and text information, and the second modality information is either image information or text information another.
  21. 一种用于多模态信息库的管理装置,包括:A management device for a multimodal information base, comprising:
    入库信息提取模块,被配置为:响应于接收到包括第一模态信息和第二模态信息的入库信息,使用第一多模态特征提取模块,从所述入库信息的第一模态信息中提取所述入库信息的第一模态特征,并且,使用第二多模态特征提取模块,从所述入库信息的第二模态信息中提取所述入库信息的第二模态特征;The warehouse-in information extraction module is configured to: in response to receiving the warehouse-in information including the first modality information and the second modality information, use the first multi-modal feature extraction module to extract from the first mode information of the warehouse-in information Extract the first modal feature of the storage information from the modal information, and use the second multi-modal feature extraction module to extract the first modal feature of the storage information from the second modal information of the storage information Two-modal feature;
    多模态信息生成模块,被配置为:基于所述入库信息的第一模态特征和第二模态特征,计算所述入库信息的多模态特征;The multi-modal information generating module is configured to: calculate the multi-modal features of the storage information based on the first and second modal features of the storage information;
    检索对象生成模块,被配置为:基于所述入库信息的第一模态特征、第二模态特征和多模态特征,生成所述多模态信息库中对应于所述入库信息的一个或多个检索对象;以及The search object generation module is configured to: generate the information corresponding to the storage information in the multi-modal information base based on the first modal feature, the second modal feature and the multi-modal feature of the storage information. one or more retrieval objects; and
    如权利要求13-20中任一项所述的用于多模态信息库的检索装置。A retrieval device for a multimodal information base as claimed in any one of claims 13-20.
  22. 根据权利要求21所述的装置,其中,所述第一模态信息为图像信息和文本信息中的任一个,所述第二模态信息为图像信息和文本信息中的另一个,The device according to claim 21, wherein the first modality information is any one of image information and text information, and the second modality information is the other of image information and text information,
    其中,所述装置还包括:Wherein, the device also includes:
    入库主体检测模块,被配置为:使用主体检测模块,从所述入库信息的图像信息中,提取所述入库信息的一条或多条主体信息;以及The storage subject detection module is configured to: use the subject detection module to extract one or more pieces of subject information of the storage information from the image information of the storage information; and
    入库特征提取模块,被配置为:使用所述第一多模态特征提取模块、所述第二多模态特征提取模块和图像特征提取模块,从所述入库信息的一条或多条主体信息中提取所述入库信息的单模态图像特征,并且,The storage feature extraction module is configured to: use the first multi-modal feature extraction module, the second multi-modal feature extraction module and the image feature extraction module to extract from one or more subjects of the storage information Extract the single-modal image features of the storage information from the information, and,
    其中,所述检索对象生成模块包括:Wherein, the retrieval object generation module includes:
    检索对象生成子模块,被配置为:基于所述入库信息的第一模态特征、第二模态特征、多模态特征和单模态图像特征,生成所述多模态信息库中对应于所述入库信息的一个或多个检索对象。The search object generation submodule is configured to: generate the corresponding information in the multi-modal information base based on the first modality feature, the second modality feature, the multi-modal feature and the single-modal image feature of the storage information. One or more search objects for the storage information.
  23. 根据权利要求22所述的装置,其中,所述第一多模态特征提取模块为多模态图像提取模块和多模态文本提取模块中的任一个,所述第二多模态特征提取模块为所述多模态图像提取模块和所述多模态文本提取模块中的另一个,并且The device according to claim 22, wherein the first multimodal feature extraction module is any one of a multimodal image extraction module and a multimodal text extraction module, and the second multimodal feature extraction module being the other of the multimodal image extraction module and the multimodal text extraction module, and
    其中,所述入库特征提取模块包括:Wherein, the feature extraction module of the storage includes:
    入库图像提取模块,被配置为:对于从所述入库信息的图像信息和一条或多条主体信息中的每一个,使用所述多模态图像提取模块,提取该信息的多模态图像特征;The input image extraction module is configured to: for each of the image information and one or more pieces of subject information from the input information, use the multi-modal image extraction module to extract a multi-modal image of the information feature;
    入库文本提取模块,被配置为:使用所述多模态文本提取模块,从所述入库信息的文本信息中提取所述入库信息的多模态文本特征;The storage text extraction module is configured to: use the multi-modal text extraction module to extract the multi-modal text features of the storage information from the text information of the storage information;
    入库主体选择模块,被配置为:The storage subject selection module is configured as:
    对于所述入库信息的图像信息和一条或多条主体信息中的每一个,计算该信息的多模态图像特征与所述入库信息的多模态文字特征的相似度,作为该信息的相似度分数;以及For each of the image information of the storage information and one or more pieces of subject information, calculate the similarity between the multi-modal image features of the information and the multi-modal text features of the storage information, as the information similarity score; and
    从所述入库信息的图像信息和一条或多条主体信息中,选择具有最大相似度分数的信息;以及Selecting the information with the largest similarity score from the image information and one or more pieces of subject information of the storage information; and
    入库单模态提取模块,被配置为:使用所述图像特征提取模块,从所述具有最大相似度分数的信息中提取所述入库信息的单模态图像特征。The warehousing single-modality extraction module is configured to: use the image feature extraction module to extract the unimodal image features of the warehousing information from the information with the maximum similarity score.
  24. 根据权利要求21所述的装置,其中,所述多模态信息生成模块包括:The device according to claim 21, wherein the multimodal information generating module comprises:
    入库乘积计算模块,被配置为:对于所述入库信息的第一模态特征和第二模态特征中的每一个,将该模态特征乘以该模态特征所对应的权重,得到该模态特征所对应的乘积;以及入库归一化模块,被配置为:对所述入库信息的第一模态特征和第二模态特征所对应的乘积之和进行归一化,得到所述检索信息的多模态特征。The warehousing product calculation module is configured to: for each of the first modal feature and the second modal feature of the warehousing information, multiply the modal feature by the weight corresponding to the modal feature to obtain The product corresponding to the modal feature; and the warehousing normalization module configured to: normalize the sum of the products corresponding to the first modal feature and the second modal feature of the warehousing information, The multimodal features of the retrieved information are obtained.
  25. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中a memory communicatively coupled to the at least one processor; wherein
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-12中任一项所述的方法。The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-12 Methods.
  26. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-12中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-12.
  27. 一种计算机程序产品,包括计算机程序,其中,所述计算机程序在被处理器执行时实现权利要求1-12中任一项所述的方法。A computer program product comprising a computer program, wherein said computer program implements the method of any one of claims 1-12 when executed by a processor.
PCT/CN2022/082949 2021-08-19 2022-03-25 Retrieval method, management method, and apparatuses for multimodal information base, device, and medium WO2023019948A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110955328.9 2021-08-19
CN202110955328.9A CN113656668B (en) 2021-08-19 2021-08-19 Retrieval method, management method, device, equipment and medium of multi-modal information base

Publications (1)

Publication Number Publication Date
WO2023019948A1 true WO2023019948A1 (en) 2023-02-23

Family

ID=78481330

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/082949 WO2023019948A1 (en) 2021-08-19 2022-03-25 Retrieval method, management method, and apparatuses for multimodal information base, device, and medium

Country Status (2)

Country Link
CN (1) CN113656668B (en)
WO (1) WO2023019948A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656668B (en) * 2021-08-19 2022-10-11 北京百度网讯科技有限公司 Retrieval method, management method, device, equipment and medium of multi-modal information base
CN114782719B (en) * 2022-04-26 2023-02-03 北京百度网讯科技有限公司 Training method of feature extraction model, object retrieval method and device
CN114661936B (en) * 2022-05-19 2022-10-14 中山大学深圳研究院 Image retrieval method applied to industrial vision and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011203776A (en) * 2010-03-24 2011-10-13 Yahoo Japan Corp Similar image retrieval device, method, and program
CN109783655A (en) * 2018-12-07 2019-05-21 西安电子科技大学 A kind of cross-module state search method, device, computer equipment and storage medium
WO2020122456A1 (en) * 2018-12-12 2020-06-18 주식회사 인공지능연구원 System and method for matching similarities between images and texts
CN111949814A (en) * 2020-06-24 2020-11-17 百度在线网络技术(北京)有限公司 Searching method, searching device, electronic equipment and storage medium
CN112015923A (en) * 2020-09-04 2020-12-01 平安科技(深圳)有限公司 Multi-mode data retrieval method, system, terminal and storage medium
CN113656668A (en) * 2021-08-19 2021-11-16 北京百度网讯科技有限公司 Retrieval method, management method, device, equipment and medium of multi-modal information base

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969417B (en) * 2020-09-23 2023-04-11 华为技术有限公司 Image reordering method, related device and computer readable storage medium
CN113076433B (en) * 2021-04-26 2022-05-17 支付宝(杭州)信息技术有限公司 Retrieval method and device for retrieval object with multi-modal information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011203776A (en) * 2010-03-24 2011-10-13 Yahoo Japan Corp Similar image retrieval device, method, and program
CN109783655A (en) * 2018-12-07 2019-05-21 西安电子科技大学 A kind of cross-module state search method, device, computer equipment and storage medium
WO2020122456A1 (en) * 2018-12-12 2020-06-18 주식회사 인공지능연구원 System and method for matching similarities between images and texts
CN111949814A (en) * 2020-06-24 2020-11-17 百度在线网络技术(北京)有限公司 Searching method, searching device, electronic equipment and storage medium
CN112015923A (en) * 2020-09-04 2020-12-01 平安科技(深圳)有限公司 Multi-mode data retrieval method, system, terminal and storage medium
CN113656668A (en) * 2021-08-19 2021-11-16 北京百度网讯科技有限公司 Retrieval method, management method, device, equipment and medium of multi-modal information base

Also Published As

Publication number Publication date
CN113656668A (en) 2021-11-16
CN113656668B (en) 2022-10-11

Similar Documents

Publication Publication Date Title
WO2023019948A1 (en) Retrieval method, management method, and apparatuses for multimodal information base, device, and medium
CN112749758B (en) Image processing method, neural network training method, device, equipment and medium
WO2022141968A1 (en) Object recommendation method and apparatus, computer device, and medium
US20230052389A1 (en) Human-object interaction detection
WO2023221422A1 (en) Neural network used for text recognition, training method thereof and text recognition method
WO2023142406A1 (en) Ranking method and apparatus, ranking model training method and apparatus, and electronic device and medium
US20220237376A1 (en) Method, apparatus, electronic device and storage medium for text classification
US20230051232A1 (en) Human-object interaction detection
US20230047628A1 (en) Human-object interaction detection
WO2023245938A1 (en) Object recommendation method and apparatus
KR20230006601A (en) Alignment methods, training methods for alignment models, devices, electronic devices and media
WO2024027125A1 (en) Object recommendation method and apparatus, electronic device, and storage medium
US20230245643A1 (en) Data processing method
WO2023050732A1 (en) Object recommendation method and device
WO2023240833A1 (en) Information recommendation method and apparatus, electronic device, and medium
US20220004801A1 (en) Image processing and training for a neural network
CN112860681B (en) Data cleaning method and device, computer equipment and medium
CN112905743B (en) Text object detection method, device, electronic equipment and storage medium
CN114494797A (en) Method and apparatus for training image detection model
CN114547252A (en) Text recognition method and device, electronic equipment and medium
CN114780846A (en) Ranking model training method, device, medium and equipment of information retrieval system
CN114238745A (en) Method and device for providing search result, electronic equipment and medium
CN115809364B (en) Object recommendation method and model training method
US20230044508A1 (en) Data labeling processing
CN114140851B (en) Image detection method and method for training image detection model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22857267

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE