WO2022052630A1 - 一种多媒体信息处理方法、装置、电子设备及存储介质 - Google Patents

一种多媒体信息处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022052630A1
WO2022052630A1 PCT/CN2021/107117 CN2021107117W WO2022052630A1 WO 2022052630 A1 WO2022052630 A1 WO 2022052630A1 CN 2021107117 W CN2021107117 W CN 2021107117W WO 2022052630 A1 WO2022052630 A1 WO 2022052630A1
Authority
WO
WIPO (PCT)
Prior art keywords
multimedia information
audio
information
target
information processing
Prior art date
Application number
PCT/CN2021/107117
Other languages
English (en)
French (fr)
Inventor
杨喻茸
徐叙远
龚国平
方杨
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21865686.6A priority Critical patent/EP4114012A4/en
Publication of WO2022052630A1 publication Critical patent/WO2022052630A1/zh
Priority to US17/962,722 priority patent/US11887619B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/50Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols using hash chains, e.g. blockchains or hash trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3247Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving digital signatures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/254Management at additional data server, e.g. shopping server, rights management server
    • H04N21/2541Rights Management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/27Server based end-user applications
    • H04N21/274Storing end-user multimedia data in response to end-user request, e.g. network recorder
    • H04N21/2743Video hosting of uploaded data from client
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q2220/00Business processing using cryptography

Definitions

  • the present application relates to multimedia information processing technology, and in particular, to a multimedia information processing method, apparatus, electronic device, and storage medium.
  • multimedia information comes in various forms, the demand for multimedia information shows explosive growth, and the quantity and types of multimedia information are also increasing.
  • An embodiment of the present application provides a method for processing multimedia information, which is executed by an electronic device, and the method includes:
  • the mel spectrogram corresponding to the audio determine the audio feature vector corresponding to the audio
  • the similarity between the target multimedia information and the source multimedia information is determined based on the audio feature vector corresponding to the source audio in the source multimedia information and the audio feature vector corresponding to the target audio in the target multimedia information.
  • the embodiment of the present application also provides a multimedia information processing device, and the device includes:
  • an information transmission module configured to parse the multimedia information to separate the audio in the multimedia information
  • Information processing module configured as:
  • the mel spectrogram corresponding to the audio determine the audio feature vector corresponding to the audio
  • the similarity between the target multimedia information and the source multimedia information is determined based on the audio feature vector corresponding to the source audio in the source multimedia information and the audio feature vector corresponding to the target audio in the target multimedia information.
  • the embodiment of the present application also provides an electronic device, the electronic device includes:
  • the processor is configured to implement the multimedia information processing method provided by the embodiments of the present application when executing the executable instructions stored in the memory.
  • Embodiments of the present application further provide a computer-readable storage medium storing executable instructions, and when the executable instructions are executed by a processor, the multimedia information processing method provided by the embodiments of the present application is implemented.
  • FIG. 1 is a schematic diagram of a usage environment of a multimedia information processing method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the composition and structure of an electronic device provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of video over-cropping according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for processing multimedia information provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a processing process of a multimedia information processing model in an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of similarity identification provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of training a multimedia information processing model provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the effect of iterative processing in the implementation of the application.
  • FIG. 9 is a schematic diagram of the architecture of a blockchain network provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a blockchain in a blockchain network provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a functional architecture of a blockchain network provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a usage scenario of the multimedia information processing method provided by the embodiment of the present application.
  • FIG. 13 is a schematic diagram of a process of using the multimedia information processing method according to the embodiment of the present application.
  • one or more of the executed operations may be real-time, or may have a set delay; Unless otherwise specified, there is no restriction on the order of execution of multiple operations to be executed.
  • Multimedia information It generally refers to various forms of multimedia information available on the Internet.
  • the multimedia information may include at least one of text, sound, and image, but is of course not limited thereto.
  • multimedia information can be long videos (such as videos uploaded by users whose duration is greater than or equal to 1 minute), short videos (such as videos uploaded by users whose duration is less than 1 minute), or audio, such as videos with fixed Screen music shorts (Music Video, MV) or records, etc.
  • APP the carrier that implements specific functions in the terminal
  • the mobile client is the carrier of specific functions in the mobile terminal, such as the function of performing online live broadcast (video streaming) or the function of playing online video.
  • Short-Time Fourier Transform It is a mathematical transform related to Fourier transform to determine the frequency and phase of the sine wave in the local area of the time-varying signal.
  • Information flow a form of content organization arranged up and down according to a specific specification style. From the perspective of display sorting, time sorting, popularity sorting or algorithm sorting can be applied.
  • Audio feature vector that is, the audio 01 vector, which is a binarized feature vector generated based on audio.
  • FIG. 1 is a schematic diagram of a usage environment of the multimedia information processing method provided by an embodiment of the present application.
  • terminals such as the terminal 10-1 and the terminal 10-2 are provided with clients capable of performing different functions, wherein the terminals ( For example, the terminal 10-1 and the terminal 10-2) can use the business process in the client to obtain different multimedia information from the corresponding server 200 through the network 300 for browsing.
  • the network 300 can be a wide area network or a local area network, or both A combination of data transmission using a wireless link.
  • the types of multimedia information acquired by the terminals (such as the terminals 10-1 and 10-2) from the corresponding servers 200 through the network 300 are not limited, for example, including but not limited to: long videos (for example, videos uploaded by users) A video with a duration greater than or equal to 1 minute, or an existing video for which the user needs to perform copyright verification), short video (such as a video uploaded by the user with a duration of less than 1 minute), and audio (such as an MV or album with a fixed image).
  • the terminals can either obtain long videos (that is, the videos carry video information or corresponding video links) from the corresponding servers 200 through the network 300, or can obtain long videos through the same video client
  • the WeChat applet uses the network 300 to obtain short videos from the corresponding server 400 for browsing.
  • Different types of multimedia information may be stored in the server 200 and the server 400 .
  • the present application does not differentiate the playback environments of different types of multimedia information.
  • the multimedia information pushed to the user's client should be copyright-compliant multimedia information. Therefore, for a large number of multimedia information, it is necessary to determine which multimedia information is similar, and further determine the copyright of similar multimedia information. Information is checked for compliance to avoid pushing duplicate or infringing multimedia information.
  • the embodiments of the present application can be applied to short video playback.
  • short video playback different short videos from different data sources are usually processed, and finally a user interface (UI, User Interface) is presented with the corresponding data.
  • UI User Interface
  • the recommended video is a pirated video whose copyright is not compliant, it will have a negative impact on the user experience.
  • the background database used for video playback receives a large number of video data from different sources every day, and the different videos obtained for multimedia information recommendation to target users can also be called by other applications (for example, the recommendation results of the short video recommendation process are migrated to long ones. video recommendation process or news recommendation process), of course, the multimedia information processing model that matches the corresponding target user can also be migrated to different multimedia information recommendation processes (such as webpage multimedia information recommendation process, applet multimedia information recommendation process or long video The client's multimedia information recommendation process).
  • the multimedia information processing method provided by the embodiments of the present application may be implemented by a terminal.
  • the terminals such as the terminal 10-1 and the terminal 10-2 can locally implement the multimedia information processing solution.
  • the multimedia information processing method provided by the embodiments of the present application may be implemented by a server.
  • the server 200 may implement a scheme of multimedia information processing.
  • the multimedia information processing method provided by the embodiments of the present application may be implemented by a terminal and a server collaboratively.
  • the terminals such as the terminal 10-1 and the terminal 10-2
  • the server 200 may send the finally obtained multimedia information to be recommended to the terminal, so as to perform multimedia information recommendation.
  • the electronic device can be implemented in various forms, such as a terminal with a multimedia information processing function, such as a mobile phone running a video client, wherein the trained multimedia information processing model can be encapsulated in
  • the storage medium of the terminal may also be a server or server group with a multimedia information processing function, wherein the trained multimedia information processing model may be deployed in the server, such as the server 200 in the aforementioned FIG. 1 .
  • FIG. 2 is a schematic diagram of the composition and structure of an electronic device provided by an embodiment of the present application. It can be understood that FIG. 2 only shows an exemplary structure of the electronic device but not the entire structure, and part or all of the structure shown in FIG. 2 can be implemented as needed. .
  • the electronic device may include: at least one processor 201 , a memory 202 , a user interface 203 , and at least one network interface 204 .
  • the various components in electronic device 20 are coupled together by bus system 205 .
  • the bus system 205 is used to implement the connection communication between these components.
  • the bus system 205 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 205 in FIG. 2 .
  • the user interface 203 may include a display, a keyboard, a mouse, a trackball, a click wheel, keys, buttons, a touch pad or a touch screen, and the like.
  • the memory 202 may be either volatile memory or non-volatile memory, and may include both volatile and non-volatile memory.
  • the memory 202 in this embodiment of the present application can store data to support the operations of the terminals (eg, the terminal 10-1 and the terminal 10-2). Examples of such data include: any computer programs, such as operating systems and applications, used to operate on terminals such as terminal 10-1 and terminal 10-2.
  • the operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks.
  • Applications can contain various applications.
  • the multimedia information processing apparatus provided by the embodiments of the present application may be implemented by a combination of software and hardware.
  • the multimedia information processing apparatus provided by the embodiments of the present application may be processors in the form of hardware decoding processors , which is programmed to execute the multimedia information processing method provided by the embodiments of the present application.
  • the processor in the form of a hardware decoding processor may adopt one or more Application Specific Integrated Circuit (ASIC, Application Specific Integrated Circuit), Digital Signal Processor (DSP, Digital Signal Processor), Programmable Logic Device (PLD, Programmable Logic Device (PLD, Programmable Logic Device). Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • the multimedia information processing apparatus provided by the embodiment of the present application may be directly embodied as a combination of software modules executed by the processor 201, and the software modules may be located in a storage medium , the storage medium is located in the memory 202, the processor 201 reads the executable instructions included in the software module in the memory 202, and combines necessary hardware (for example, including the processor 201 and other components connected to the bus system 205) to complete the embodiments of the present application Provided is a multimedia information processing method.
  • the processor 201 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, DSP, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., wherein the general-purpose processing
  • the processor may be a microprocessor or any conventional processor or the like.
  • the apparatus provided by this embodiment of the present application may be directly executed by a processor 201 in the form of a hardware decoding processor, for example, by one or more ASICs. , DSP, PLD, CPLD, FPGA, or other electronic components to implement the multimedia information processing method provided by the embodiments of the present application.
  • the memory 202 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device 20 .
  • Examples of these data include: any executable instructions for operating on the electronic device 20, such as executable instructions, and the program implementing the multimedia information processing method of the embodiment of the present application may be included in the executable instructions.
  • the multimedia information processing apparatus provided by the embodiments of the present application may be implemented in software.
  • FIG. 2 shows the multimedia information processing apparatus 2020 stored in the memory 202, which may be software in the form of programs and plug-ins. , and includes a series of modules.
  • the program stored in the memory 202 it may include a multimedia information processing apparatus 2020 .
  • the multimedia information processing apparatus 2020 includes the following software modules: an information transmission module 2081 and an information processing module 2082 .
  • the software module in the multimedia information processing apparatus 2020 is read into the RAM by the processor 201 and executed, the multimedia information processing method provided by the embodiment of the present application will be implemented.
  • an embodiment of the present application further provides a computer program product or computer program, where the computer program product or computer program includes computer instructions (executable instructions), and the computer instructions are stored in a computer-readable storage in the medium.
  • the processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the different embodiments and implementations provided in the various optional implementations of the above-mentioned multimedia information processing method combination of examples.
  • FIG. 3 is a schematic diagram of excessive video clipping provided by an embodiment of the present application.
  • more and more video pictures are based on the same fixed background, and only the smaller picture ratio is user-defined content, which results in the overall high similarity between the two pictures, but the actual Different content, and all belong to original content.
  • audio information in videos can be compared through an audio fingerprint algorithm to determine whether the videos are similar.
  • the audio is firstly Fourier transformed to generate an audio spectrogram; then, on the basis of the audio spectrogram, the corresponding constellation map (Constellation Map) is calculated according to the frequency peak points. ); finally process the star diagram to generate a combined hash landmark(t1, f1, f2, t2-t1), where t1 and t2 represent two time points respectively, and f1 represents the maximum frequency value corresponding to the time point t1, f2 represents the maximum frequency value corresponding to the time point of t2.
  • this process cannot achieve accurate identification for use environments with a lot of attack information.
  • the landmark depends on the frequency peak, and the frequency of the audio is changed in the voice-changing video, it will cause the generated Different hashes will eventually lead to the failure of similar identification (that is, wrongly identified as dissimilar); similarly, for the use environment of double-speed/slow-speed attacks, since the combined hash in the landmark depends on dt(t2-t1), while the double-speed attack / The dt change of slow audio will cause the generated hash to be different, which will eventually lead to the failure of similar recognition.
  • FIG. 4 is a schematic flowchart of a multimedia information processing method provided by an embodiment of the present application. It is understood that the steps shown in FIG. 4 can be executed by various electronic devices running the multimedia information processing apparatus, For example, it can be a terminal, server or server cluster with multimedia information processing function. For example, when the multimedia information processing device runs in the terminal, it can trigger the applet in the terminal to perform similarity detection (similarity identification) of multimedia information. ; When the multimedia information processing device runs in the long video copyright detection server and the music playing software server, it can perform copyright detection on the corresponding long video or audio information. The steps shown in FIG. 4 will be described below.
  • Step 401 The multimedia information processing apparatus parses the multimedia information to separate audio in the multimedia information.
  • the multimedia information is acquired, and the multimedia information is parsed to obtain the audio in the multimedia information.
  • the multimedia information is parsed to separate out the audio in the multimedia information, which can be implemented in the following ways:
  • an audio synchronization packet in the video may be acquired first, and the audio synchronization packet is used to represent timing information. Then, the audio header decoding data AACDecoderSpecificInfo and the audio data configuration information AudioSpecificConfig in the audio synchronization packet are parsed to obtain the playback duration parameter and the audio track information parameter corresponding to the video.
  • the audio data configuration information AudioSpecificConfig is used to generate ADST (including the sampling rate, the number of channels, and the frame length data in the audio). Based on the audio track information parameters, other audio packets in the video are obtained, and the original audio data is parsed.
  • the Elementary Stream (ES) of AAC is packaged by the Advanced Audio Coding (AAC) decoder of the audio data header.
  • AAC Advanced Audio Coding
  • a 7-byte header file ADTSheader can be added before the AAC ES stream to extract the audio in multimedia information (such as video).
  • Step 402 The multimedia information processing apparatus performs conversion processing on the audio to obtain a mel-spectrogram corresponding to the audio.
  • the audio is converted to a mel-spectrogram. Since the unit of frequency is Hertz (Hz), the frequency range that the human ear can hear is 20-20000Hz, but the human ear has no linear perception relationship to the scale unit of Hz. For example, if humans are adapted to a 1000Hz tone, if If the pitch frequency is increased to 2000Hz, the human ear can only detect that the frequency has increased a little, but will not notice that the frequency has doubled. If the audio is converted to data in a mel spectrogram (that is, converting the frequency scale to a mel frequency scale), the human ear's perception of frequency becomes a linear relationship. That is to say, under the Mel frequency scale, if the Mel frequencies of the two audios are twice different, the tones that can be perceived by the human ear are also about twice different, so that the perception of the audio can be improved. Beneficial technical effect of visualizing audio.
  • Hz Hertz
  • converting the audio to obtain a mel spectrogram corresponding to the audio can be implemented in the following ways:
  • the audio can be resampled (that is, channel conversion processing) into 16KHz mono audio data first; then a 25ms Hann time window can be used, and a 10ms frame shift and periodic Hann window can be used to perform short-term processing on the mono audio data.
  • the windowing function and the length parameter can be regarded as corresponding to the multimedia information processing model.
  • Step 403 The multimedia information processing apparatus determines an audio feature vector corresponding to the audio according to the mel spectrogram corresponding to the audio.
  • the audio feature vector corresponding to the audio is determined according to the mel spectrogram corresponding to the audio, and the audio feature vector can accurately and effectively reflect the characteristics of the audio.
  • determining the audio feature vector corresponding to the audio according to the mel spectrogram corresponding to the audio can be implemented in the following ways:
  • the corresponding input triplet samples are determined based on the mel spectrogram; the input triplet samples are cross-processed through the convolution layer and the maximum pooling layer of the multimedia information processing model, and the downsampling results of different input triplet samples are obtained. ; Normalize the down-sampling results through the fully connected layer of the multimedia information processing model to obtain the normalized results; perform deep decomposition processing on the normalized results through the multimedia information processing model, and obtain the results corresponding to different input triples samples. matched audio feature vector.
  • FIG. 5 is a schematic diagram of a processing process of a multimedia information processing model in an embodiment of the present application, wherein the feature extraction of the multimedia information processing model may be implemented through a visual geometry group (VGGish, Visual Geometry Group) network, for example
  • VGish Visual Geometry Group
  • the audio features are extracted from the Mel spectrogram through the Vggish network, and the extracted vectors are clustered and encoded by the spatial local aggregation vector (NetVector of Locally Aggregated Descriptors) to obtain the audio feature vector.
  • NetVLAD can save the distance between each feature point and its nearest cluster center and use it as a new feature.
  • the VGGish network supports the extraction of 128-dimensional embedding feature vectors with semantics from the corresponding audio, that is, audio feature vectors.
  • the audio is first converted into the input triplet sample of the Mel spectrogram as the input of the VGGish network.
  • An example of the conversion process is as follows: use the signal amplitude to calculate the audio spectrogram, convert the spectrogram Map to a 64-order mel filter bank to calculate the mel spectrogram, and obtain N input triple samples mapped from Hz to the mel spectrogram, and the feature dimension is N*96*64.
  • the Tensorflow-based VGGish network can be used as an audio feature extractor, that is, the input triplet sample is used as the input of the VGGish network, and the VGGish network is used for feature extraction to obtain N*128 audio feature vectors.
  • Step 404 The multimedia information processing apparatus determines the similarity between the target multimedia information and the source multimedia information based on the audio feature vector corresponding to the source audio in the source multimedia information and the audio feature vector corresponding to the target audio in the target multimedia information.
  • the similarity between two multimedia information can be determined according to the audio feature vector.
  • they are named source multimedia information and target multimedia information, respectively, and the audio in the source multimedia information is named source Audio, the audio in the target multimedia information is named target audio, then through steps 401 to 403, the audio feature vector corresponding to the source audio and the audio feature vector corresponding to the target audio can be determined.
  • the similarity between the target multimedia information and the source multimedia information can be determined, that is, to achieve similarity recognition.
  • FIG. 6 is a schematic flowchart of similarity identification provided by this embodiment of the application. It is understood that the steps shown in FIG. 6 can be executed by various electronic devices running the multimedia information processing apparatus, such as multimedia information processing.
  • the multimedia information processing device runs in the terminal, it can trigger the applet in the terminal to perform similarity identification of multimedia information; when the multimedia information processing device runs on the short video copyright detection server, music playback software When in the server, copyright detection can be performed on the corresponding short video or audio.
  • the steps shown in FIG. 6 will be described below.
  • Step 601 Determine the corresponding interframe similarity parameter set based on the audio feature vector corresponding to the source audio in the source multimedia information and the audio feature vector corresponding to the target audio in the target multimedia information.
  • the source audio may be divided into multiple audio frames
  • the target audio may be divided into multiple audio frames
  • the source audio and the target audio may correspond to the same division standard (eg, the duration of each audio frame).
  • pairwise combination processing such as exhaustive pairwise combination processing
  • the audio frame pairs include one audio frame divided by the source audio and one audio frame divided by the target audio.
  • the inter-frame similarity between the two audio frames is determined according to the audio feature vectors corresponding to the two audio frames in the audio frame pair respectively. Then, an inter-frame similarity parameter set is constructed according to all the inter-frame similarities.
  • Step 602 Determine the number of audio frames that reach the similarity threshold in the inter-frame similarity parameter set.
  • the number of audio frames whose inter-frame similarity reaches a similarity threshold is determined (the number of audio frames here may refer to the number of pairs of audio frames).
  • Step 603 when the number of audio frames reaching the similarity threshold exceeds the number threshold, perform step 604 , otherwise perform step 605 .
  • the number of audio frames that reach the similarity threshold is compared with the number threshold.
  • the number of audio frames that reach the similarity threshold exceeds the number threshold, it is determined that the target multimedia information is similar to the source multimedia information; when the number of audio frames that reach the similarity threshold does not exceed the number threshold, it is determined that the target multimedia information is not similar to the source multimedia information.
  • Step 604 Determine that the target multimedia information is similar to the source multimedia information, and prompt to provide copyright information.
  • the target multimedia information is similar to the source multimedia information, it is proved that there may be a risk of copyright infringement, so the copyright information can be prompted to provide, and the prompt provided here can be the copyright information of at least one of the target multimedia information and the source multimedia information .
  • Step 605 It is determined that the target multimedia information is not similar to the source multimedia information, and a corresponding recommendation process is entered.
  • the recommendation process can be used to recommend at least one of the target multimedia information and the source multimedia information. one.
  • the copyright information of the target multimedia information and the copyright information of the source multimedia information are obtained; by the copyright information of the target multimedia information and the copyright information of the source multimedia information , determine the legitimacy of the target multimedia information; when the copyright information of the target multimedia information and the copyright information of the source multimedia information are inconsistent, a warning message is issued.
  • the copyright information of the target multimedia information and the copyright information of the source multimedia information can be obtained, and the copyright information of the target multimedia information and the source multimedia information can be obtained by obtaining the copyright information of the target multimedia information and the source multimedia information.
  • the copyright information of the multimedia information determines the legality of the target multimedia information.
  • the target multimedia information is legal by default
  • the copyright information of the target multimedia information is consistent with the copyright information of the source multimedia information
  • the copyright information of the target multimedia information is inconsistent with the copyright information of the source multimedia information
  • warning information may also be issued.
  • the legality of the source multimedia information may also be determined on the premise that the target multimedia information is legal by default.
  • the target multimedia information when it is determined that the target multimedia information is not similar to the source multimedia information, the target multimedia information is added to the multimedia information source; the recall order of the multimedia information to be recommended in the multimedia information source is sorted; The sorting result of the recall order of the recommended multimedia information is used to recommend multimedia information to the target user.
  • the target multimedia information can be added to the multimedia information source as the multimedia information to be recommended in the multimedia information source.
  • Source multimedia information can also be added to the source of multimedia information.
  • FIG. 7 is a schematic flowchart of training a multimedia information processing model provided by an embodiment of the application. It is understood that the steps shown in FIG. 7 can be executed by various electronic devices running the multimedia information processing apparatus, such as multimedia A terminal, server or server cluster for information processing functions. Before deploying the multimedia information processing model, the multimedia information processing model may be trained, which will be described with reference to the steps shown in FIG. 7 .
  • Step 701 Obtain a first training sample set, where the first training sample set includes audio samples in the collected multimedia information.
  • a first training sample set is obtained, where the first training sample set includes audio samples in video information collected (eg, collected by a terminal), and the first training sample set may include at least one audio sample.
  • Step 702 Add noise to the first training sample set to obtain a corresponding second training sample set.
  • noise addition is performed on the first training sample set to obtain a corresponding second training sample set, which can be implemented in the following ways:
  • the audio information attack includes but is not limited to: an attack by changing the audio frequency, and an attack by changing the speed of the video. Therefore, in the construction process of the second training sample set, an audio enhancement data set can be made according to these audio attack types, wherein the audio enhancement form (ie dynamic noise type) includes but is not limited to: voice change, increase of background noise, volume Change, sample rate change, sound quality change. Different enhanced audio can be obtained by setting different audio enhancement forms. It should be noted that, in some embodiments of the present application, the construction of the second training sample set does not use the situation that the video duration changes or the frame pairs are not neat due to frame shift.
  • a second training sample set is made according to the audio enhancement data set, for example, one original audio corresponds to 20 attack audios in the audio enhancement data set, where each attack audio and the original audio have the same duration and no frame shift (that is, the corresponding time points of the audio are the same ), the audio duration is dur, and with 0.96s as the step, each set of audio (original audio + corresponding attack audio) will generate dur/0.96 tags, and the tags at the same time point are the same. From the attack audio and the corresponding labels, a second set of training samples can be constructed.
  • Step 703 Process the second training sample set by the multimedia information processing model to determine the initial parameters of the multimedia information processing model.
  • Step 704 In response to the initial parameters of the multimedia information processing model, the second training sample set is processed by the multimedia information processing model to determine the update parameters of the multimedia information processing model.
  • the second training sample set in response to the initial parameters of the multimedia information processing model, is processed by the multimedia information processing model to determine the update parameters of the multimedia information processing model, which can be achieved in the following ways:
  • Step 705 Iteratively update the network parameters of the multimedia information processing model through the second training sample set according to the update parameters of the multimedia information processing model.
  • the convergence condition matching the ternary loss function layer network in the multimedia information processing model can be determined; the network parameters of the ternary loss function layer network are iteratively updated until the loss function corresponding to the ternary loss function layer network satisfies the corresponding Convergence condition.
  • the multimedia information processing model including the VGGish network as an example, in the training phase, the 128-dimensional vector obtained by the VGGish network is input into the triplet loss function network (triplet-loss layer) in the multimedia information processing model for training. Similar embedding outputs (i.e. audio feature vectors).
  • Triplet loss refers to formula 1:
  • L represents a ternary loss function
  • a is a sample
  • p represents a sample similar to a
  • n represents a sample that belongs to a different category from a (ie, is not similar to a)
  • d(a, p) is a and p in the vector space
  • the distance of d(a, n) is the same.
  • FIG. 8 is a schematic diagram of the effect of the iterative processing in the implementation of the application.
  • the final optimization goal of the iterative processing shown in FIG. 8 is to shorten the distance between a and p, and narrow the distance between a and n.
  • the distance can include the following three situations:
  • the related information of multimedia information can be stored in the blockchain network or cloud server, so as to achieve accurate judgment of the similarity of multimedia information.
  • the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information may also be sent to the blockchain network, so that the
  • the nodes of the blockchain network fill the identification of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information into the new block, and when the consensus on the new block is consistent, the new block is added to the blockchain. the tail.
  • it also includes:
  • Receive data synchronization requests from other nodes in the blockchain network verify the permissions of other nodes in response to the data synchronization requests; when the permissions of other nodes pass the verification, control the data synchronization between the current node and other nodes to Enable other nodes to obtain the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information.
  • the method further includes: in response to the query request, parsing the query request to obtain the corresponding object identifier; acquiring permission information in the target block in the blockchain network according to the object identifier; matching the permission information with the object identifier
  • the authority information matches the object identification, the identification of the corresponding multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information are obtained in the blockchain network;
  • the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information are sent to the corresponding client, so that the client obtains the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and Copyright information for multimedia information.
  • FIG. 9 is a schematic diagram of the architecture of a blockchain network provided by an embodiment of the present application, including a blockchain network 200 (including a plurality of consensus nodes, and a consensus node 210 is exemplarily shown in FIG. 9 ), a certification center 300 , the business entity 400 and the business entity 500 will be described separately below.
  • a blockchain network 200 including a plurality of consensus nodes, and a consensus node 210 is exemplarily shown in FIG. 9
  • a certification center 300 the business entity 400 and the business entity 500 will be described separately below.
  • the type of the blockchain network 200 is flexible and diverse, for example, it can be any one of a public chain, a private chain or a consortium chain.
  • the electronic equipment of any business entity such as user terminals and servers, can access the blockchain network 200 without authorization; taking the alliance chain as an example, the business entity will govern after obtaining authorization.
  • the electronic device (for example, a terminal/server) can access the blockchain network 200, and at this time, it becomes a client node in the blockchain network 200.
  • the client node can only serve as an observer of the blockchain network 200, that is, to provide the function of supporting business entities to initiate transactions (for example, for storing data on the chain or querying data on the chain), for the blockchain network
  • the functions of the consensus node 210 of 200 can be implemented by the client node by default or selectively (eg, depending on the specific business needs of the business entity). Therefore, the data and business processing logic of the business entity can be migrated to the blockchain network 200 to the greatest extent, and the trustworthiness and traceability of the data and business processing process can be realized through the blockchain network 200 .
  • the consensus node in the blockchain network 200 receives client nodes from different business subjects (for example, the business subject 400 and the business subject 500 shown in the previous implementation)
  • the client node 410 of 400 and the client node 510 belonging to the database operator system submit the transaction, execute the transaction to update the ledger or query the ledger, and various intermediate results or final results of the executed transaction can be returned to the client of the business entity. displayed in the node.
  • client nodes 410/510 can subscribe to events of interest in the blockchain network 200, such as transactions occurring in a specific organization/channel in the blockchain network 200, and the consensus node 210 pushes corresponding transaction notifications to the client node 410/510, thereby triggering the corresponding business logic in the client node 410/510.
  • a plurality of business entities involved in the management process such as the business entity 400 may be a multimedia information processing device, and the business entity 500 may be a display system with a multimedia information processing function, registering from the authentication center 300 to obtain respective digital Certificate, the digital certificate includes the public key of the business subject, and the digital signature signed by the certification center 300 on the public key of the business subject and the identity information, which is used to attach to the transaction together with the digital signature of the business subject for the transaction, and is sent to the transaction.
  • the blockchain network allows the blockchain network to extract the digital certificate and signature from the transaction, to verify the reliability of the message (that is, whether it has not been tampered with) and the identity information of the business entity sending the message. Validation, such as permission to initiate transactions.
  • Clients running on electronic devices (such as terminals or servers) under the jurisdiction of the business entity can request access to the blockchain network 200 to become client nodes.
  • the client node 410 of the business body 400 is configured to send the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information to the blockchain network, so that the nodes of the blockchain network can send the multimedia information to the blockchain network.
  • the identifier of the new block, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information are filled into the new block, and when the consensus on the new block is consistent, the new block is appended to the end of the blockchain.
  • the identifier of the corresponding multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information are sent to the blockchain network 200, and business logic can be set in the client node 410 in advance, for example, when the target is determined When the multimedia information is not similar to the source multimedia information, the client node 410 automatically sends the identifier of the target multimedia information, the audio feature vector corresponding to the audio in the target multimedia information, and the copyright information of the target multimedia information to the blockchain network 200, or can The business personnel of the business entity 400 log in in the client node 410, manually package the identifier of the target multimedia information, the audio feature vector corresponding to the audio in the target multimedia information, and the copyright information of the target multimedia information, and send it to the blockchain Network 200.
  • the client node 410 When sending, the client node 410 generates a transaction corresponding to the update operation according to the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information, and specifies the smart contract that needs to be called to realize the update operation in the transaction. , and the parameters passed to the smart contract, the transaction also carries the digital certificate of the client node 410 and the signed digital signature (for example, using the private key in the digital certificate of the client node 410 to encrypt the transaction summary), And broadcast the transaction to consensus nodes 210 in the blockchain network 200.
  • the consensus node 210 in the blockchain network 200 receives the transaction, it verifies the digital certificate and digital signature carried by the transaction. After the verification is successful, it confirms whether the business subject 400 has the identity of the business subject 400 carried in the transaction. Any verification judgment of transaction authority, digital signature and authority verification will cause the transaction to fail. After the verification is successful, the digital signature of the consensus node 210 is signed (for example, obtained by encrypting the transaction digest with the private key of the consensus node 210 ), and it continues to be broadcast in the blockchain network 200 .
  • the consensus node 210 in the blockchain network 200 After receiving the successfully verified transaction, the consensus node 210 in the blockchain network 200 fills the transaction into a new block and broadcasts it. When the consensus node 210 in the blockchain network 200 broadcasts a new block, it will perform a consensus process on the new block. If the consensus is successful, the new block will be appended to the end of the blockchain stored by itself, and updated according to the result of the transaction.
  • the state database executes the transaction in the new block: for the transaction of the audio feature vector corresponding to the identity of the multimedia information to be processed, the audio corresponding to the multimedia information and the copyright information of the multimedia information, in the state database, the identity including the multimedia information is added in the state database. , the audio feature vector corresponding to the audio in the multimedia information and the key-value pair of the copyright information of the multimedia information.
  • the business personnel of the business entity 500 logs in in the client node 510, and inputs a query request for the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information, and the client node 510 generates a query request according to the query request.
  • the smart contract to be called to realize the update operation/query operation and the parameters passed to the smart contract are specified in the transaction, and the transaction also carries the digital certificate of the client node 510 and the signed digital signature.
  • the digest of the transaction is encrypted using the private key in the digital certificate of the client node 510
  • the transaction is broadcast to the consensus node 210 in the blockchain network 200 .
  • the consensus node 210 in the blockchain network 200 receives the transaction, verifies the transaction, fills the block and agrees with the consensus, and appends the filled new block to the end of the blockchain stored by itself, and according to the result of the transaction Update the state database, and execute the transaction in the new block: for the submitted transaction to update the copyright information of a certain multimedia information, update the key-value pair corresponding to the copyright information of the multimedia information in the state database; For the transaction of copyright information, the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information and the key-value pair corresponding to the copyright information of the multimedia information are queried from the state database, and the transaction result is returned.
  • FIG. 9 exemplarily shows the process of directly uploading the identification of multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information, but in other embodiments , for a situation where the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information occupy a large amount of data, the client node 410 may associate the identifier of the multimedia information, the audio in the multimedia information corresponding to The audio feature vector of the multimedia information and the hash of the copyright information of the multimedia information are paired up on the chain, and the identification of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information and the copyright information of the multimedia information are stored in the distributed file system or database.
  • the client node 510 After the client node 510 obtains the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information from the distributed file system or database, it can be combined with the corresponding hash in the blockchain network 200 for verification. This reduces the workload of on-chain operations.
  • FIG. 10 is a schematic structural diagram of a blockchain in a blockchain network 200 provided by an embodiment of the present application.
  • the header of each block may include the hash of all transactions in the block.
  • the hash value also contains the hash value of all transactions in the previous block.
  • the record of the newly generated transaction is filled into the block and after the consensus of the nodes in the blockchain network, it will be appended to the end of the blockchain Thus, chain growth is formed, and the chain structure based on the hash value between blocks ensures the tamper-proof and anti-forgery of transactions in the block.
  • FIG. 11 is a schematic diagram of the functional architecture of the blockchain network 200 provided by the embodiments of the present application, including an application layer 201 and a consensus layer 202 , the network layer 203 , the data layer 204 and the resource layer 205 , which will be described separately below.
  • the resource layer 205 encapsulates the computing resources, storage resources and communication resources for realizing each consensus node 210 in the blockchain network 200 .
  • the data layer 204 encapsulates various data structures that implement the ledger, including blockchains implemented as files in a file system, key-value state databases, and proofs of existence (eg, hash trees of transactions in blocks).
  • the network layer 203 encapsulates the functions of point-to-point (P2P, Point to Point) network protocol, data dissemination mechanism and data verification mechanism, access authentication mechanism and business subject identity management.
  • P2P Point to Point
  • the P2P network protocol realizes the communication between the consensus nodes 210 in the blockchain network 200
  • the data dissemination mechanism ensures the dissemination of transactions in the blockchain network 200
  • the data verification mechanism is used based on cryptography methods (such as digital certificates, digital signature, public/private key pair) to achieve the reliability of data transmission between consensus nodes 210
  • the access authentication mechanism is used to authenticate the identity of the business subject joining the blockchain network 200 according to the actual business scenario, and to authenticate
  • the authority to access the blockchain network 200 is given to the business entity
  • the business entity identity management is used to store the identity and authority of the business entity allowed to access the blockchain network 200 (for example, the types of transactions that can be initiated).
  • the consensus layer 202 encapsulates a mechanism (ie, a consensus mechanism) for consensus nodes 210 in the blockchain network 200 to reach consensus on blocks, and functions of transaction management and ledger management.
  • the consensus mechanism includes consensus algorithms such as POS, POW, and DPOS, and supports the pluggability of consensus algorithms.
  • the transaction management is used to verify the digital signature carried in the transaction received by the consensus node 210, verify the identity information of the business entity, and determine whether it has the authority to conduct transactions according to the identity information (read relevant information from the business entity identity management); For business entities authorized to access the blockchain network 200, they all have digital certificates issued by the certification center. The business entities use the private key in their digital certificates to sign the submitted transactions, thereby declaring their legal identity.
  • Ledger management is used to maintain the blockchain and state database.
  • For the consensus block append it to the end of the blockchain; execute the transaction in the consensus block, update the key-value pair in the state database when the transaction includes an update operation, and query the state database when the transaction includes a query operation and returns the query result to the client node of the business principal.
  • Supports query operations in various dimensions of the state database including: querying blocks according to block vector numbers (such as transaction hash values); querying blocks according to block hash values; querying blocks according to transaction vector numbers; Query the transaction by the transaction vector number; query the account data of the business entity according to the account number (vector number) of the business entity; query the blockchain in the channel according to the channel name.
  • the application layer 201 encapsulates various services that the blockchain network can implement, including transaction traceability, certificate storage, and verification.
  • the copyright information of the multimedia information that has been identified by similarity can be stored in the blockchain network.
  • the multimedia information server can call the copyright information in the blockchain network. Verify the copyright compliance of multimedia information uploaded by users.
  • FIG. 12 is a schematic diagram of a usage scenario of the multimedia information processing method provided by the embodiment of the present application, wherein the case where the multimedia information is a short video is taken as an example, terminals (such as the terminal 10-1 and the terminal 10-2 shown in FIG. 1 )
  • a client terminal of software capable of displaying corresponding short videos such as a client terminal or plug-in for short video playback, is provided on the terminal, and the user can obtain and display short videos through the corresponding client terminal; the terminal connects to the short video server 200 through the network 300, It can be a wide area network or a local area network, or a combination of the two, using wireless links to achieve data transmission.
  • users can also upload short videos through the applet in the terminal for other users in the network to watch.
  • the video server of the operator needs to detect the short videos uploaded by users and compare different video information. and analysis, such as determining whether the copyright of short videos uploaded by users is compliant, and recommending compliant videos to different users, so as to avoid users’ short videos from being pirated.
  • Step 1301 Acquire the audio in the video.
  • the acquired video can be parsed to separate out the audio in the video, and the audio can also be preprocessed through a preprocessing process, such as determining a mel spectrogram corresponding to the audio.
  • Step 1302 Obtain a training sample set of the video information processing model (corresponding to the multimedia information processing model above).
  • Step 1303 Train the video information processing model, and determine the corresponding model parameters (network parameters).
  • the video information processing model is trained according to the training sample set (such as the second training sample set above), and corresponding model parameters are determined.
  • Step 1304 Deploy the trained video information processing model in the corresponding video detection server.
  • correlation detection can be performed by the trained video information processing model.
  • Step 1305 Detect audio in different videos by using a video information processing model to determine whether different videos are similar.
  • the video is a short video
  • obtain the copyright information of the target short video for example, obtain the corresponding copyright information uploaded by the user through the applet run by the terminal 10-1, or according to The copyright information is obtained from the storage location of the copyright information in the cloud server network.
  • the legality of the target short video is determined by the copyright information of the target short video and the copyright information of the source video.
  • the target short video is added to the video source (corresponding to the multimedia information source above) as the video to be recommended in the video source. Sort the recall order of all videos to be recommended in the video source, and recommend videos to the target user based on the sorting result of the recall order of the videos to be recommended, which is more conducive to the push of original videos.
  • the software modules stored in the multimedia information processing apparatus 2020 of the memory 202 may be Including: an information transmission module 2081, configured to parse the multimedia information to separate audio in the multimedia information; an information processing module 2082, configured to convert the audio to obtain a mel spectrogram corresponding to the audio; Corresponding Mel spectrogram, determine the audio feature vector corresponding to the audio; Based on the audio feature vector corresponding to the source audio in the source multimedia information and the audio feature vector corresponding to the target audio in the target multimedia information, determine the target multimedia information and the source multimedia information. similarity of information.
  • the information transmission module 2081 is further configured to: parse the multimedia information to obtain the timing information of the multimedia information; parse the video parameters corresponding to the multimedia information according to the timing information of the multimedia information to obtain the video parameters corresponding to the multimedia information Corresponding playback duration parameters and audio track information parameters; based on the playback duration parameters and audio track information parameters corresponding to the multimedia information, the multimedia information is extracted to obtain audio in the multimedia information.
  • the information processing module 2082 is further configured to: perform channel conversion processing on the audio to obtain monophonic audio data; perform short-time Fourier transform on the monophonic audio data based on a windowing function to obtain corresponding The spectrogram of ; the spectrogram is processed according to the duration parameter, and the mel spectrogram corresponding to the audio is obtained.
  • the information processing module 2082 is further configured to: determine the corresponding input triplet samples based on the mel spectrogram; pass the convolution layer and the maximum pooling layer of the multimedia information processing model to the input triplet samples Cross-process to obtain down-sampling results of different input triplet samples; normalize the down-sampling results through the full connection layer of the multimedia information processing model to obtain normalized results; The result is subjected to a depth decomposition process to obtain audio feature vectors that match different input triplet samples.
  • the information processing module 2082 is further configured to: obtain a first training sample set, where the first training sample set includes audio samples in the collected video information; perform noise addition on the first training sample set to obtain corresponding the second training sample set; the second training sample set is processed by the multimedia information processing model to determine the initial parameters of the multimedia information processing model; in response to the initial parameters of the multimedia information processing model, the second training sample set is processed by the multimedia information processing model The sample set is processed to determine the update parameters of the multimedia information processing model; according to the update parameters of the multimedia information processing model, the network parameters of the multimedia information processing model are iteratively updated through the second training sample set.
  • the information processing module 2082 is further configured to: determine a dynamic noise type that matches the use environment of the multimedia information processing model; and perform noise addition on the first training sample set according to the dynamic noise type to change the first training At least one of the background noise, volume, sampling rate, and sound quality of the audio samples in the sample set is used to obtain a corresponding second training sample set.
  • the information processing module 2082 is further configured to: substitute different audio samples in the second training sample set into the loss function corresponding to the ternary loss function layer network of the multimedia information processing model; determine that the loss function satisfies the corresponding The convergence condition corresponds to the parameters of the ternary loss function layer network; the parameters of the ternary loss function layer network are used as the update parameters of the multimedia information processing model.
  • the information processing module 2082 is further configured to: determine a convergence condition matching the ternary loss function layer network in the multimedia information processing model; iteratively update the network parameters of the ternary loss function layer network until the third The loss function corresponding to the meta-loss function layer network satisfies the convergence condition.
  • the information processing module 2082 is further configured to: determine the corresponding inter-frame similarity based on the audio feature vector corresponding to the source audio in the source multimedia information and the audio feature vector corresponding to the target audio in the target multimedia information parameter set; determine the number of audio frames that reach the similarity threshold in the inter-frame similarity parameter set; determine the similarity between the target multimedia information and the source multimedia information based on the number of audio frames that reach the similarity threshold.
  • the information processing module 2082 is further configured to: when it is determined that the target multimedia information is similar to the source multimedia information, obtain copyright information of the target multimedia information and copyright information of the source multimedia information; The copyright information of the source multimedia information determines the legality of the target multimedia information; when the copyright information of the target multimedia information is inconsistent with the copyright information of the source multimedia information, a warning message is issued.
  • the information processing module 2082 is further configured to: when it is determined that the target multimedia information is not similar to the source multimedia information, add the target multimedia information to the multimedia information source; recall the multimedia information to be recommended in the multimedia information source Sort by order; based on the sorting result of the recall order of the multimedia information to be recommended, the multimedia information is recommended to the target user.
  • the information processing module 2082 is further configured to: send the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information to the blockchain network, so that the blockchain network The node of the node fills the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information into the new block, and when the consensus on the new block is consistent, the new block is appended to the end of the blockchain.
  • the information processing module 2082 is further configured to: receive data synchronization requests from other nodes in the blockchain network; in response to the data synchronization request, verify the permissions of other nodes; when the permissions of other nodes pass the verification At the time, the current node is controlled to perform data synchronization with other nodes, so that other nodes can obtain the identifier of the multimedia information, the audio feature vector corresponding to the audio in the multimedia information, and the copyright information of the multimedia information.
  • the information processing module 2082 is further configured to: in response to the query request, parse the query request to obtain the corresponding object identifier; obtain permission information in the target block in the blockchain network according to the object identifier; The matching of the information and the object identification is checked; when the authority information matches the object identification, the identification of the corresponding multimedia information, the audio feature vector corresponding to the audio in the multimedia information and the copyright of the multimedia information are obtained in the blockchain network. information; send the identifier of the acquired multimedia information, the audio feature vector corresponding to the audio in the multimedia information and the copyright information of the multimedia information to the corresponding client, so that the client obtains the identifier of the multimedia information, the audio in the multimedia information Corresponding audio feature vector and copyright information of multimedia information.
  • the embodiments of the present application have at least the following technical effects: the embodiments of the present application determine the mel spectrogram corresponding to the audio, and determine the audio feature vector corresponding to the audio according to the mel spectrogram.
  • the similarity between the multimedia information is effectively determined, and the accuracy of the similarity judgment of the multimedia information is improved.
  • the multimedia information is video
  • the misjudgment of video similarity caused by excessive processing (eg, excessive cropping) of the video image due to the judgment solely relying on the video image is reduced.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Human Resources & Organizations (AREA)
  • Tourism & Hospitality (AREA)
  • Primary Health Care (AREA)
  • Computing Systems (AREA)
  • Technology Law (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本申请提供了一种多媒体信息处理方法、装置、电子设备以及存储介质,方法包括:对多媒体信息进行解析以分离出多媒体信息中的音频;对音频进行转换处理,得到与音频相对应的梅尔频谱图;根据音频相对应的梅尔频谱图,确定音频对应的音频特征向量;基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定目标多媒体信息与源多媒体信息的相似度。

Description

一种多媒体信息处理方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请基于申请号为202010956391.X、申请日为2020年09月11日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及多媒体信息处理技术,尤其涉及多媒体信息处理方法、装置、电子设备及存储介质。
背景技术
相关技术中,多媒体信息的形式多种多样,多媒体信息的需求量呈现爆发式增长,多媒体信息的数量和种类也越来越多。
以视频为例,随着视频编辑工具的普及和发展,视频画面攻击种类变得更加复杂,通过视频过度裁剪,视频的相似辨别愈加困难。在这种经过裁剪的视频中,单纯的依赖视频图像指纹难以识别部分对画面改变较多的视频重复和侵权内容,导致相似识别的精度低。
发明内容
本申请实施例的技术方案是这样实现的:
本申请实施例提供了一种多媒体信息处理方法,由电子设备执行,所述方法包括:
对多媒体信息进行解析以分离出所述多媒体信息中的音频;
对所述音频进行转换处理,得到与所述音频相对应的梅尔频谱图;
根据所述音频相对应的梅尔频谱图,确定所述音频对应的音频特征向量;
基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定所述目标多媒体信息与所述源多媒体信息的相似度。
本申请实施例还提供了一种多媒体信息处理装置,所述装置包括:
信息传输模块,配置为对多媒体信息进行解析以分离出所述多媒体信息中的音频;
信息处理模块,配置为:
对所述音频进行转换处理,得到与所述音频相对应的梅尔频谱图;
根据所述音频相对应的梅尔频谱图,确定所述音频对应的音频特征向量;
基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定所述目标多媒体信息与所述源多媒体信息的相似度。
本申请实施例还提供了一种电子设备,所述电子设备包括:
存储器,用于存储可执行指令;
处理器,用于运行所述存储器存储的可执行指令时,实现本申请实施例提供的多媒体信息处理方法。
本申请实施例还提供了一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时实现本申请实施例提供的多媒体信息处理方法。
附图说明
图1是本申请实施例提供的一种多媒体信息处理方法的使用环境示意图;
图2为本申请实施例提供的电子设备的组成结构示意图;
图3为本申请实施例提供的视频过度裁剪的示意图;
图4为本申请实施例提供的多媒体信息处理方法的流程示意图;
图5为本申请实施例中的多媒体信息处理模型的处理过程示意图;
图6为本申请实施例提供的相似识别的流程示意图;
图7为本申请实施例提供的训练多媒体信息处理模型的流程示意图;
图8为本申请实施中迭代处理的效果示意图;
图9是本申请实施例提供的区块链网络的架构示意图;
图10是本申请实施例提供的区块链网络中区块链的结构示意图;
图11是本申请实施例提供的区块链网络的功能架构示意图;
图12为本申请实施例提供的多媒体信息处理方法的使用场景示意图;
图13为本申请实施例中多媒体信息处理方法的使用过程示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)响应于:用于表示所执行的操作所依赖的条件或者状态,当满足所依赖的条件或状态时,所执行的一个或多个操作可以是实时的,也可以具有设定的延迟;在没有特别说明的情况下,所执行的多个操作不存在执行先后顺序的限制。
2)多媒体信息:泛指互联网中可获取的各种形式的多媒体信息。多媒体信息可以包括文本、声音以及图像中的至少之一,当然并不限于此。例如,多媒体信息可以是长视频(如用户上传的视频时长大于或等于1分钟的视频),可以是短视频(如用户上传的视频时长小于1分钟的视频),也可以是音频,如带固定画面的音乐短片(Music Video,MV)或者唱片等。
3)客户端:终端中实现特定功能的载体,例如移动客户端(APP)是移动终端中特定功能的载体,功能如执行线上直播(视频推流)的功能或者是在线视频的播放功能。
4)短时傅里叶变换(STFT,Short-Time Fourier Transform):是和傅里叶变换相关的一种数学变换,用以确定时变信号其局部区域正弦波的频率与相位。
5)梅尔频谱(MBF,Mel Bank Features):由于对音频进行处理(如STFT处理)得到的声谱图较大,故为了得到合适大小的声音特征,可以将声谱图通过梅尔尺度滤波器组(Mel-scale filter banks),以将声谱图变为梅尔频谱。其中,声谱图是由频谱图在时间上堆叠起来得到的。
6)信息流:按照特定规格样式的上下排布的一种内容组织形态。从展示排序角度而言,可以应用时间排序、热度排序或算法排序等方式。
7)音频特征向量:即音频01向量,是基于音频生成的二值化的特征向量。
图1为本申请实施例提供的多媒体信息处理方法的使用环境示意图,参见图1,终端(如终端10-1和终端10-2)上设置有能够执行不同功能的客户端,其中,终端(如终端10-1和终端10-2)可以利用客户端中的业务进程,通过网络300从相应的服务器200中获取不同的多媒体信息进行浏览,网络300可以是广域网或者局域网,又或者是二者的组合,使用无线链路实现数据传输。其中,对终端(如终端10-1和终端10-2)通过网络300从相应的服务器200中所获取的多媒体信息的类型并不限定,例如包括但不限于:长视频(例如用户上传的视频时长大于或等于1分钟的视频,或者用户需要进行版权验证的已有视频)、短视频(例如用户上传的视频时长小于1分钟的视频)、音频(例如带固定画面的MV或者唱片)。例如:终端(如终端10-1和终端10-2)既可以通过网络300从相应的服务器200中获取长视频(即视频中携带视频信息或相应的视频链接),也可以通过同一视频客户端或者微信小程序利用网络300从相应的服务器400中获取短视频进行浏览。服务器200和服务器400中可以保存有不同类型的多媒体信息。其中,本申请中不对不同类型的多媒体信息的播放环境进行区分。在这一过程中向用户的客户端推送的多媒体信息应当是版权合规的多媒体信息,因此对于数量众多的多媒体信息,需要判断哪些多媒体信息是相似的,并进一步地对相似的多媒体信息的版权信息进行合规检测,避免推送重复或者侵权的多媒体信息。
以短视频为例,本申请实施例可以应用于短视频播放,在短视频播放中通常会对不同数据来源的不同短视频进行处理,最终在用户界面(UI,User Interface)上呈现出与相应的用户相对应的待推荐视频,如果推荐的视频是版权不合规的盗播视频,则会对用户体验造成不良影响。用于视频播 放的后台数据库每天都会收到大量不同来源的视频数据,所得到的向目标用户进行多媒体信息推荐的不同视频还可以供其他应用程序调用(例如短视频推荐进程的推荐结果迁移至长视频推荐进程或者新闻推荐进程),当然,与相应的目标用户相匹配的多媒体信息处理模型也可以迁移至不同的多媒体信息推荐进程(例如网页多媒体信息推荐进程、小程序多媒体信息推荐进程或者长视频客户端的多媒体信息推荐进程)。
在一些实施例中,本申请实施例提供的多媒体信息处理方法可以由终端实现。例如,终端(如终端10-1和终端10-2)可以在本地实现多媒体信息处理的方案。
在一些实施例中,本申请实施例提供的多媒体信息处理方法可以由服务器实现。例如,服务器200可以实现多媒体信息处理的方案。
在一些实施例中,本申请实施例提供的多媒体信息处理方法可以由终端及服务器协同实现。例如,终端(如终端10-1和终端10-2)可以向服务器200发送请求,以请求服务器200实现多媒体信息处理的方案。服务器200可以将最终得到的待推荐多媒体信息发送至终端,以进行多媒体信息推荐。
下面对本申请实施例的电子设备的结构做详细说明,电子设备可以各种形式来实施,如带有多媒体信息处理功能的终端例如运行视频客户端的手机,其中经过训练的多媒体信息处理模型可以封装在终端的存储介质中,也可以为带有多媒体信息处理功能的服务器或者服务器群组,其中经过训练的多媒体信息处理模型可以部署在服务器中,例如前述图1中的服务器200。图2为本申请实施例提供的电子设备的组成结构示意图,可以理解,图2仅仅示出了电子设备的示例性结构而非全部结构,根据需要可以实施图2示出的部分结构或全部结构。
本申请实施例提供的电子设备可以包括:至少一个处理器201、存储器202、用户接口203和至少一个网络接口204。电子设备20中的各个组件通过总线***205耦合在一起。可以理解,总线***205用于实现这些组件之间的连接通信。总线***205除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线***205。
其中,用户接口203可以包括显示器、键盘、鼠标、轨迹球、点击轮、按键、按钮、触感板或者触摸屏等。
可以理解,存储器202可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。本申请实施例中的存储器202能够存储数据以支持终端(如终端10-1和终端10-2)的操作。这些数据的示例包括:用于在终端(如终端10-1和终端10-2)上操作的任何计算机程序,如操作***和应用程序。其中,操作***包含各种***程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务。应用程序可以包含各种应用程序。
在一些实施例中,本申请实施例提供的多媒体信息处理装置可以采用软硬件结合的方式实现,作为示例,本申请实施例提供的多媒体信息处理装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的多媒体信息处理方法。例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、数字信号处理器(DSP,Digital Signal Processor)、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
作为本申请实施例提供的多媒体信息处理装置采用软硬件结合实施的示例,本申请实施例所提供的多媒体信息处理装置可以直接体现为由处理器201执行的软件模块组合,软件模块可以位于存储介质中,存储介质位于存储器202,处理器201读取存储器202中软件模块包括的可执行指令,结合必要的硬件(例如,包括处理器201以及连接到总线***205的其他组件)完成本申请实施例提供的多媒体信息处理方法。
作为示例,处理器201可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
作为本申请实施例提供的多媒体信息处理装置采用硬件实施的示例,本申请实施例所提供的装置可以直接采用硬件译码处理器形式的处理器201来执行完成,例如,被一个或多个ASIC、DSP、PLD、CPLD、FPGA或其他电子元件执行实现本申请实施例提供的多媒体信息处理方法。
本申请实施例中的存储器202用于存储各种类型的数据以支持电子设备20的操作。这些数据的示例包括:用于在电子设备20上操作的任何可执行指令,如可执行指令,实现本申请实施例的多媒体信息处理方法的程序可以包含在可执行指令中。
在另一些实施例中,本申请实施例提供的多媒体信息处理装置可以采用软件方式实现,图2示出了存储在存储器202中的多媒体信息处理装置2020,其可以是程序和插件等形式的软件,并包括一系列的模块,作为存储器202中存储的程序的示例,可以包括多媒体信息处理装置2020,多媒体信息处理装置2020中包括以下的软件模块:信息传输模块2081,信息处理模块2082。当多媒体信息处理装置2020中的软件模块被处理器201读取到RAM中并执行时,将实现本申请实施例提供的多媒体信息处理方法。
根据图2所示的电子设备,本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令(可执行指令),该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该电子设备执行上述多媒体信息处理方法的各种可选实现方式中所提供的不同实施例及实施例的组合。
在介绍本申请提供的多媒体信息处理方法之前,首先介绍相关技术的缺陷,虽然现有的视频服务器通过相应的匹配算法可以粗略识别视频间的相似关系,但是随着视频编辑工具的普及和发展,视频画面攻击种类变得更加复杂,参考图3,图3为本申请实施例提供的视频过度裁剪的示意图,在图3所示的经过裁剪的视频中,单纯的依赖视频图像指纹难以识别部分对画面改变较多的视频重复/侵权内容。且随着视频互动玩法的升级,越来越多的视频画面基于同一固定背景,只有较小的画面比例是用户自定义的内容,这就造成了两个画面之间整体高度相似,但实际是不同内容,且均属于原创内容。现有技术中,可以通过音频指纹算法对视频中的音频信息进行比较,以判定视频是否相似。具体来说,以landmark的音频指纹算法为例,首先对音频进行傅里叶变换,生成音频谱图;之后在音频谱图的基础上,根据频率峰值点计算出相应的星状图(Constellation Map);最后对星状图进行处理,生成组合哈希landmark(t1,f1,f2,t2-t1),其中,t1和t2分别代表两个时间点,f1表示对应t1时间点的最大频率值,f2表示对应t2时间点的最大频率值。但是,这一过程对于攻击信息较多的使用环境无法实现准确识别,例如:对于变音攻击的使用环境,由于landmark依赖频率峰值点,而变音视频中改变了音频的频率,会导致生成的hash不同,最终导致相似识别失败(即错误地识别为不相似);同样的,对于倍速/慢速攻击的使用环境来说,由于landmark中的组合hash依赖于dt(t2-t1),而倍速/慢速音频的dt改变,会导致生成的hash不同,最终导致相似识别失败。
为了克服上述缺陷,参见图4,图4为本申请实施例提供的多媒体信息处理方法的流程示意图,可以理解地,图4所示的步骤可以由运行多媒体信息处理装置的各种电子设备执行,例如可以是带有多媒体信息处理功能的终端、服务器或者服务器集群,举例来说,当多媒体信息处理装置运行在终端中时,可以触发终端中的小程序进行多媒体信息的相似性检测(相似识别);当多媒体信息处理装置运行在长视频版权检测服务器、音乐播放软件服务器中时,可以对相应的长视频或者音频信息进行版权检测。下面针对图4示出的步骤进行说明。
步骤401:多媒体信息处理装置对多媒体信息进行解析以分离出多媒体信息中的音频。
这里,获取多媒体信息,并对多媒体信息进行解析,得到多媒体信息中的音频。
在本申请的一些实施例中,对多媒体信息进行解析以分离出多媒体信息中的音频,可以通过以下方式实现:
对多媒体信息进行解析,得到多媒体信息的时序信息;根据多媒体信息的时序信息,对多媒体信息所对应的视频参数进行解析,得到与多媒体信息对应的播放时长参数与音轨信息参数;基于多媒体信息对应的播放时长参数与音轨信息参数,对多媒体信息进行抽取得到多媒体信息中的音频。
以多媒体信息为视频的情况为例,可以首先获取视频中的音频同步包,该音频同步包用于体现时序信息。然后,解析音频同步包中的音频头解码数据AACDecoderSpecificInfo和音频数据配置信息AudioSpecificConfig,得到视频对应的播放时长参数与音轨信息参数。其中,音频数据配置信息AudioSpecificConfig用于生成ADST(包括音频中的采样率、声道数、帧长度数据)。基于音轨信息参数获取视频中的其他音频包,并解析出原始音频数据,最后通过音频数据头高级音频编码(Advanced Audio Coding,AAC)解码器把AAC的基本码流(Elementary Stream,ES)打包成ADTS的格式,例如,可以在AAC ES流前添加7个字节的头文件ADTSheader,实现抽取得到多媒体信息(如视频)中的音频。
步骤402:多媒体信息处理装置对音频进行转换处理,得到与音频相对应的梅尔频谱图。
这里,对音频进行转换处理,以将音频转换为梅尔频谱图。由于频率的单位是赫兹(Hz),人耳能听到的频率范围是20-20000Hz,但人耳对Hz这种标度单位并不是线性感知关系,例如,如果人类适应了1000Hz的音调,如果把音调频率提高到2000Hz,那人类的耳朵只能觉察到频率提高了一 点点,根本察觉不到频率提高了一倍。如果将音频转换为梅尔频谱图中的数据(即是将频率标度转化为梅尔频率标度),则人耳对频率的感知度就变成了线性关系。也就是说,在梅尔频率标度下,如果两段音频的梅尔频率相差两倍,则人耳可以感知到的音调大概也相差两倍,由此,可以实现提升对音频的感知度,将音频进行具象化的有益技术效果。
在本申请的一些实施例中,对音频进行转换处理,得到与音频相对应的梅尔频谱图,可以通过以下方式实现:
对音频进行声道转换处理,得到单声道音频数据;基于加窗函数对单声道音频数据进行短时傅里叶变换,得到相应的频谱图;根据时长参数对频谱图进行处理,得到与音频相对应的梅尔频谱图。
例如,可以首先将音频重采样(即声道转换处理)为16KHz单声道音频数据;之后使用25ms的Hann时窗,以10ms的帧移,周期性Hann窗口对单声道音频数据进行短时傅里叶变换得到相应的频谱图;通过将频谱图映射到64阶的mel滤波器组中,从而计算mel声谱,其中,mel bins的范围为125-7500Hz;计算log(mel-spectrum+0.01),得到稳定的mel声谱,所加的0.01的偏置是为了避免对0取对数;将获得的这些特征以0.96s的时长参数进行组帧,其中没有帧的重叠,每一帧都包含64个mel频带,时长10ms(共96帧),由此实现提取音频相应的梅尔频谱图。其中,加窗函数及时长参数可以视为是与多媒体信息处理模型对应的。
步骤403:多媒体信息处理装置根据音频相对应的梅尔频谱图,确定音频对应的音频特征向量。
这里,根据音频相对应的梅尔频谱图,确定音频对应的音频特征向量,该音频特征向量能够准确、有效地体现音频的特点。
在本申请的一些实施例中,根据音频相对应的梅尔频谱图,确定音频对应的音频特征向量,可以通过以下方式实现:
基于梅尔频谱图确定相应的输入三元组样本;通过多媒体信息处理模型的卷积层和最大值池化层对输入三元组样本交叉进行处理,得到不同输入三元组样本的降采样结果;通过多媒体信息处理模型的全连接层对降采样结果进行归一化处理,得到归一化结果;通过多媒体信息处理模型对归一化结果进行深度分解处理,得到与不同输入三元组样本相匹配的音频特征向量。
作为示例,参考图5,图5为本申请实施例中的多媒体信息处理模型的处理过程示意图,其中,多媒体信息处理模型的特征提取可以通过视觉几何组(VGGish,Visual Geometry Group)网络实现,例如通过Vggish网络对梅尔频谱图进行音频特征的抽取,并对抽取得到的向量通过空间局部聚合向量(NetVLAD,Net Vector of Locally Aggregated Descriptors)聚类编码,得到音频特征向量。其中,NetVLAD可以保存每个特征点与离它最近的聚类中心的距离,并将其作为新的特征。
以VGGish网络为例继续说明,VGGish网络支持从相应的音频中提取具有语义的128维embedding特征向量,即音频特征向量。在特征提取的过程中,首先将音频转换成梅尔频谱图的输入三元组样本,以作为VGGish网络的输入,转换过程示例如下:利用信号幅值计算音频的声谱图,将声谱图映射到64阶梅尔滤波器组中计算梅尔频谱图,得到N个从Hz映射到梅尔频谱图的输入三元组样本,特征维度为N*96*64。这里,可以将基于Tensorflow的VGGish网络作为音频特征提取器,即是将输入三元组样本作为VGGish网络的输入,利用VGGish网络进行特征提取,得到N*128的音频特征向量。
步骤404:多媒体信息处理装置基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定目标多媒体信息与源多媒体信息的相似度。
在本申请实施例中,可以根据音频特征向量来确定两个多媒体信息之间的相似度,为了便于区分,分别命名为源多媒体信息和目标多媒体信息,并将源多媒体信息中的音频命名为源音频,将目标多媒体信息中的音频命名为目标音频,则通过步骤401至步骤403,可以确定源音频对应的音频特征向量以及目标音频对应的音频特征向量。
这里,基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,可以确定目标多媒体信息与源多媒体信息之间的相似度,即实现对多媒体信息的相似识别。
继续参考图6,图6为本申请实施例提供的相似识别的流程示意图,可以理解地,图6所示的步骤可以由运行多媒体信息处理装置的各种电子设备执行,例如可以是多媒体信息处理功能的终端、服务器或者服务器集群,当多媒体信息处理装置运行在终端中时,可以触发终端中的小程序进行多媒体信息的相似识别;当多媒体信息处理装置运行在短视频版权检测服务器、音乐播放软件服务器中时,可以对相应的短视频或者音频进行版权检测。下面针对图6示出的步骤进行说明。
步骤601:基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标 音频对应的音频特征向量,确定对应的帧间相似度参数集合。
例如,可以将源音频划分为多个音频帧,同时将目标音频划分为多个音频帧,其中,源音频和目标音频可以对应相同的划分标准(如每个音频帧的时长)。然后,对由源音频划分出的多个音频帧以及由目标音频划分出的多个音频帧进行两两组合处理(如穷举式的两两组合处理),得到多个音频帧对,其中每个音频帧对包括由源音频划分出的一个音频帧、以及由目标音频划分出的一个音频帧。
对于每个音频帧对,根据音频帧对中的两个音频帧分别对应的音频特征向量,确定这两个音频帧之间的帧间相似度。然后,根据所有的帧间相似度构建帧间相似度参数集合。
步骤602:确定帧间相似度参数集合中达到相似度阈值的音频帧数量。
这里,在帧间相似度参数集合中,确定帧间相似度达到相似度阈值的音频帧数量(这里的音频帧数量可以是指音频帧对的数量)。
步骤603:当达到相似度阈值的音频帧数量超过数量阈值时,执行步骤604,否则执行步骤605。
这里,将达到相似度阈值的音频帧数量与数量阈值进行比对。当达到相似度阈值的音频帧数量超过数量阈值时,确定目标多媒体信息与源多媒体信息相似;当达到相似度阈值的音频帧数量未超过数量阈值时,确定目标多媒体信息与源多媒体信息不相似。
步骤604:确定目标多媒体信息与源多媒体信息相似,提示提供版权信息。
这里,当目标多媒体信息与源多媒体信息相似时,证明可能存在版权侵权的风险,因此可以提示提供版权信息,这里所提示提供的可以是目标多媒体信息及源多媒体信息中的至少之一的版权信息。
步骤605:确定目标多媒体信息与源多媒体信息不相似,进入相应的推荐进程。
这里,当目标多媒体信息与源多媒体信息不相似时,证明不存在版权侵权的风险,因此可以直接进入相应的推荐进程,这里的推荐进程可以用于推荐目标多媒体信息及源多媒体信息中的至少之一。
在本申请的一些实施例中,当确定目标多媒体信息与源多媒体信息相似时,获取目标多媒体信息的版权信息和源多媒体信息的版权信息;通过目标多媒体信息的版权信息和源多媒体信息的版权信息,确定目标多媒体信息的合法性;当目标多媒体信息的版权信息和源多媒体信息的版权信息不一致时,发出警示信息。
这里,当确定目标多媒体信息与源多媒体信息相似时,证明可能存在版权侵权的风险,因此,可以获取目标多媒体信息的版权信息和源多媒体信息的版权信息,并通过目标多媒体信息的版权信息和源多媒体信息的版权信息确定目标多媒体信息的合法性。
以源多媒体信息默认合法为例,当目标多媒体信息的版权信息和源多媒体信息的版权信息一致时,确定目标多媒体信息合法;当目标多媒体信息的版权信息和源多媒体信息的版权信息不一致时,确定目标多媒体信息不合法。此外,当目标多媒体信息的版权信息和源多媒体信息的版权信息不一致时,还可以发出警示信息。
当然,本申请实施例也可以在默认目标多媒体信息合法的前提下,确定源多媒体信息的合法性。
在本申请的一些实施例中,当确定目标多媒体信息与源多媒体信息不相似时,将目标多媒体信息添加至多媒体信息源;对多媒体信息源中的待推荐多媒体信息的召回顺序进行排序;基于待推荐多媒体信息的召回顺序的排序结果,向目标用户进行多媒体信息推荐。
这里,当确定目标多媒体信息与源多媒体信息不相似时,证明不存在版权侵权的风险,故可以将目标多媒体信息添加至多媒体信息源,以作为多媒体信息源中的待推荐多媒体信息,当然,这里也可以将源多媒体信息添加至多媒体信息源。在需要进行多媒体信息推荐时,对多媒体信息源中的待推荐多媒体信息的召回顺序进行排序,并基于待推荐多媒体信息的召回顺序的排序结果,向目标用户进行多媒体信息推荐。
参见图7,图7为本申请实施例提供的训练多媒体信息处理模型的流程示意图,可以理解地,图7所示的步骤可以由运行多媒体信息处理装置的各种电子设备执行,例如可以是多媒体信息处理功能的终端、服务器或者服务器集群。在部署多媒体信息处理模型之前,可以对多媒体信息处理模型进行训练,将结合图7示出的步骤进行说明。
步骤701:获取第一训练样本集合,其中第一训练样本集合包括采集的多媒体信息中的音频样本。
这里,获取第一训练样本集合,该第一训练样本集合包括采集(如通过终端进行采集)的视频信息中的音频样本,第一训练样本集合可以包括至少一个音频样本。
步骤702:对第一训练样本集合进行噪声添加,得到相应的第二训练样本集合。
在本申请的一些实施例中,对第一训练样本集合进行噪声添加,得到相应的第二训练样本集合,可以通过以下方式实现:
确定与多媒体信息处理模型的使用环境相匹配的动态噪声类型;根据动态噪声类型对第一训练样本集合进行噪声添加,以改变第一训练样本集合中音频样本的背景噪音、音量、采样率以及音质中的至少之一,得到相应的第二训练样本集合。
在本申请实施例中,音频信息攻击包括但不限于:音频频率改变进行攻击、视频倍速改变进行攻击。因此,在第二训练样本集合的构造过程中,可以根据这些音频攻击类型来制作音频增强数据集,其中,音频增强形式(即动态噪声类型)包括但不限于:变音、增加背景噪音、音量改变、采样率改变、音质改变。通过设定不同的音频增强形式可得到不同的增强音频。需要说明的是,在本申请的一些实施例中,第二训练样本集合的构造不使用视频时长改变或者有帧移导致帧对不整齐的情况。
根据音频增强数据集制作第二训练样本集合,例如一个原始音频对应音频增强数据集中的20个攻击音频,此处每个攻击音频和原始音频的时长相同并且没有帧移(即对应时间点音频相同),音频时长为dur,以0.96s为step,则每一组音频(原始音频+对应攻击音频)会产生dur/0.96个标签,相同时间点的标签相同。根据攻击音频以及相应的标签,可以构建第二训练样本集合。
步骤703:通过多媒体信息处理模型对第二训练样本集合进行处理,以确定多媒体信息处理模型的初始参数。
步骤704:响应于多媒体信息处理模型的初始参数,通过多媒体信息处理模型对第二训练样本集合进行处理,以确定多媒体信息处理模型的更新参数。
在本申请的一些实施例中,响应于多媒体信息处理模型的初始参数,通过多媒体信息处理模型对第二训练样本集合进行处理,以确定多媒体信息处理模型的更新参数,可以通过以下方式实现:
将第二训练样本集合中不同音频样本,代入多媒体信息处理模型的三元损失函数层网络所对应的损失函数;确定损失函数满足相应的收敛条件时对应三元损失函数层网络的参数;将三元损失函数层网络的参数作为多媒体信息处理模型的更新参数。
步骤705:根据多媒体信息处理模型的更新参数,通过第二训练样本集合对多媒体信息处理模型的网络参数进行迭代更新。
例如,可以确定与多媒体信息处理模型中三元损失函数层网络相匹配的收敛条件;对三元损失函数层网络的网络参数进行迭代更新,直至三元损失函数层网络对应的损失函数满足对应的收敛条件。以多媒体信息处理模型包括VGGish网络为例,训练阶段将VGGish网络得到的128维向量输入多媒体信息处理模型中的三元损失函数网络(triplet-loss层),以进行训练,最终实现相似的音频得到相似的embedding输出结果(即音频特征向量)。Triplet loss的公式参考公式1:
L=max(d(a,p)-d(a,n)+margin,0)公式1
其中L表征三元损失函数,a是样本,p代表与a相似的样本,n代表与a属于不同类别(即与a不相似)的样本,d(a,p)是a和p在向量空间的距离,d(a,n)同理,通过最小化上述损失函数,可以学习到相似样本和不相似样本的区分度。
作为示例,参考图8,图8为本申请实施中迭代处理的效果示意图,图8所示的迭代处理的最终优化目标是拉近a与p之间的距离,拉远a与n之间的距离,可以包括以下三种情况:
1)easy triplets:L=0即d(a,p)+margin&lt;d(a,n)d(a,p)+margin<d(a,n),这种情况不需要优化,即在天然上a与p之间的距离很近,a与n之间的距离很远。
2)hard triplets:d(a,n)&lt;d(a,p)d(a,n)<d(a,p),即a和p之间的距离远。
3)semi-hard triplets:d(a,p)&lt;d(a,n)&lt;d(a,p)+margind(a,p)&lt;d(a,n)&lt;d(a,p)+margind(a,p)<d(a,n)<d(a,p)+margin,即a与n之间的距离很近,但是存在一个margin。
在实际应用场景中,考虑到多媒体信息的数量不断增加,因此,可以将多媒体信息的相关信息保存在区块链网络或者云服务器中,从而实现对多媒体信息相似性的准确判断。
在本申请的一些实施例中,还可以将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息发送至区块链网络,以使
区块链网络的节点将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息填充至新区块,且当对新区块共识一致时,将新区块追加至区块链的尾部。
在一些实施例中,还包括:
接收区块链网络中的其他节点的数据同步请求;响应于数据同步请求,对其他节点的权限进行验证;当其他节点的权限通过验证时,控制当前节点与其他节点之间进行数据同步,以使其他节点 获取多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息。
在一些实施例中,还包括:响应于查询请求,解析查询请求以得到对应的对象标识;根据对象标识获取区块链网络中的目标区块内的权限信息;对权限信息与对象标识的匹配性进行校验;当权限信息与对象标识相匹配时,在区块链网络中获取相应的多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息;将所获取的多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息向相应的客户端进行发送,以使客户端获取多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息。
继续参见图9,图9是本申请实施例提供的区块链网络的架构示意图,包括区块链网络200(包括多个共识节点,图9中示例性示出了共识节点210)、认证中心300、业务主体400和业务主体500,下面分别进行说明。
区块链网络200的类型是灵活多样的,例如可以为公有链、私有链或联盟链中的任意一种。以公有链为例,任何业务主体的电子设备例如用户终端和服务器,都可以在不需要授权的情况下接入区块链网络200;以联盟链为例,业务主体在获得授权后其下辖的电子设备(例如终端/服务器)可以接入区块链网络200,此时,成为区块链网络200中的客户端节点。
在一些实施例中,客户端节点可以只作为区块链网络200的观察者,即提供支持业务主体发起交易(例如,用于上链存储数据或查询链上数据)功能,对于区块链网络200的共识节点210的功能,例如排序功能、共识服务和账本功能等,客户端节点可以缺省或者有选择性(例如,取决于业务主体的具体业务需求)地实施。从而,可以将业务主体的数据和业务处理逻辑最大程度迁移到区块链网络200中,通过区块链网络200实现数据和业务处理过程的可信和可追溯。
区块链网络200中的共识节点接收来自不同业务主体(例如前序实施中示出的业务主体400和业务主体500)的客户端节点(例如,前序实施例中示出的归属于业务主体400的客户端节点410、以及归属于数据库运营商***的客户端节点510)提交的交易,执行交易以更新账本或者查询账本,执行交易的各种中间结果或最终结果可以返回业务主体的客户端节点中显示。
例如,客户端节点410/510可以订阅区块链网络200中感兴趣的事件,例如区块链网络200中特定的组织/通道中发生的交易,由共识节点210推送相应的交易通知到客户端节点410/510,从而触发客户端节点410/510中相应的业务逻辑。
下面以多个业务主体接入区块链网络以实现对多媒体信息的相关信息的管理为例,说明区块链网络的示例性应用。
参见图9,管理环节涉及的多个业务主体,如业务主体400可以是多媒体信息处理装置,业务主体500可以是带有多媒体信息处理功能的显示***,从认证中心300进行登记注册获得各自的数字证书,数字证书中包括业务主体的公钥、以及认证中心300对业务主体的公钥和身份信息签署的数字签名,用来与业务主体针对交易的数字签名一起附加到交易中,并被发送到区块链网络,以供区块链网络从交易中取出数字证书和签名,验证消息的可靠性(即是否未经篡改)和发送消息的业务主体的身份信息,区块链网络会根据身份进行验证,例如是否具有发起交易的权限。业务主体下辖的电子设备(例如终端或者服务器)运行的客户端都可以向区块链网络200请求接入而成为客户端节点。
业务主体400的客户端节点410用于将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息发送至区块链网络,以使区块链网络的节点将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息填充至新区块,且当对新区块共识一致时,将新区块追加至区块链的尾部。
其中,将相应的多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息发送至区块链网络200,可以预先在客户端节点410设置业务逻辑,例如,当确定目标多媒体信息与源多媒体信息不相似时,客户端节点410将目标多媒体信息的标识、目标多媒体信息中的音频对应的音频特征向量和目标多媒体信息的版权信息自动发送至区块链网络200,也可以由业务主体400的业务人员在客户端节点410中登录,手动打包目标多媒体信息的标识、目标多媒体信息中的音频对应的音频特征向量和目标多媒体信息的版权信息,并将其发送至区块链网络200。在发送时,客户端节点410根据多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息生成对应更新操作的交易,在交易中指定了实现更新操作需要调用的智能合约、以及向智能合约传递的参数,交易还携带了客户端节点410的数字证书、签署的数字签名(例如,使用客户端节点410的数字证书中的私钥,对交易的摘要进行加密得到),并将交易广播到区块链网络200中的共识节点210。
区块链网络200中的共识节点210中接收到交易时,对交易携带的数字证书和数字签名进行验证,验证成功后,根据交易中携带的业务主体400的身份,确认业务主体400是否是具有交易权限,数字签名和权限验证中的任何一个验证判断都将导致交易失败。验证成功后签署共识节点210自己的数字签名(例如,使用共识节点210的私钥对交易的摘要进行加密得到),并继续在区块链网络200中广播。
区块链网络200中的共识节点210接收到验证成功的交易后,将交易填充到新的区块中,并进行广播。区块链网络200中的共识节点210广播的新区块时,会对新区块进行共识过程,如果共识成功,则将新区块追加到自身所存储的区块链的尾部,并根据交易的结果更新状态数据库,执行新区块中的交易:对于提交更新待处理的多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息的交易,在状态数据库中添加包括多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息的键值对。
业务主体500的业务人员在客户端节点510中登录,输入针对多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息的查询请求,客户端节点510根据该查询请求生成对应更新操作/查询操作的交易,在交易中指定了实现更新操作/查询操作需要调用的智能合约、以及向智能合约传递的参数,交易还携带了客户端节点510的数字证书、签署的数字签名(例如,使用客户端节点510的数字证书中的私钥,对交易的摘要进行加密得到),并将交易广播到区块链网络200中的共识节点210。
区块链网络200中的共识节点210中接收到交易,对交易进行验证、区块填充及共识一致后,将填充的新区块追加到自身所存储的区块链的尾部,并根据交易的结果更新状态数据库,执行新区块中的交易:对于提交的更新某一多媒体信息的版权信息的交易,更新状态数据库中该多媒体信息的版权信息对应的键值对;对于提交的查询某个多媒体信息的版权信息的交易,从状态数据库中查询该多媒体信息的标识、该多媒体信息中的音频对应的音频特征向量和该多媒体信息的版权信息对应的键值对,并返回交易结果。
值得说明的是,在图9中示例性地示出了将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息直接上链的过程,但在另一些实施例中,对于多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息所占的数据量较大的情况,客户端节点410可将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息的哈希进行成对上链,同时将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息存储于分布式文件***或数据库。客户端节点510从分布式文件***或数据库获取到多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息后,可结合区块链网络200中对应的哈希进行校验,从而减少上链操作的工作量。
作为区块链的示例,参见图10,图10是本申请实施例提供的区块链网络200中区块链的结构示意图,每个区块的头部既可以包括区块中所有交易的哈希值,同时也包含前一个区块中所有交易的哈希值,新产生的交易的记录被填充到区块并经过区块链网络中节点的共识后,会被追加到区块链的尾部从而形成链式的增长,区块之间基于哈希值的链式结构保证了区块中交易的防篡改和防伪造。
下面说明本申请实施例提供的区块链网络的示例性的功能架构,参见图11,图11是本申请实施例提供的区块链网络200的功能架构示意图,包括应用层201、共识层202、网络层203、数据层204和资源层205,下面分别进行说明。
资源层205封装了实现区块链网络200中的各个共识节点210的计算资源、存储资源和通信资源。
数据层204封装了实现账本的各种数据结构,包括以文件***中的文件实现的区块链,键值型的状态数据库和存在性证明(例如区块中交易的哈希树)。
网络层203封装了点对点(P2P,Point to Point)网络协议、数据传播机制和数据验证机制、接入认证机制和业务主体身份管理的功能。
其中,P2P网络协议实现区块链网络200中共识节点210之间的通信,数据传播机制保证了交易在区块链网络200中的传播,数据验证机制用于基于加密学方法(例如数字证书、数字签名、公/私钥对)实现共识节点210之间传输数据的可靠性;接入认证机制用于根据实际的业务场景对加入区块链网络200的业务主体的身份进行认证,并在认证通过时赋予业务主体接入区块链网络200的权限;业务主体身份管理用于存储允许接入区块链网络200的业务主体的身份、以及权限(例如能够发起的交易的类型)。
共识层202封装了区块链网络200中的共识节点210对区块达成一致性的机制(即共识机制)、交易管理和账本管理的功能。共识机制包括POS、POW和DPOS等共识算法,支持共识算法的可插拔。
交易管理用于验证共识节点210接收到的交易中携带的数字签名,验证业务主体的身份信息,并根据身份信息判断确认其是否具有权限进行交易(从业务主体身份管理读取相关信息);对于获得接入区块链网络200的授权的业务主体而言,均拥有认证中心颁发的数字证书,业务主体利用自己的数字证书中的私钥对提交的交易进行签名,从而声明自己的合法身份。
账本管理用于维护区块链和状态数据库。对于取得共识的区块,追加到区块链的尾部;执行取得共识的区块中的交易,当交易包括更新操作时更新状态数据库中的键值对,当交易包括查询操作时查询状态数据库中的键值对并向业务主体的客户端节点返回查询结果。支持对状态数据库的多种维度的查询操作,包括:根据区块向量号(例如交易的哈希值)查询区块;根据区块哈希值查询区块;根据交易向量号查询区块;根据交易向量号查询交易;根据业务主体的账号(向量号)查询业务主体的账号数据;根据通道名称查询通道中的区块链。
应用层201封装了区块链网络能够实现的各种业务,包括交易的溯源、存证和验证等。
由此,经过相似性识别的多媒体信息的版权信息可以保存在区块链网络中,当新的用户上传多媒体信息至多媒体信息服务器中时,多媒体信息服务器可以调用区块链网络中的版权信息,对用户上传的多媒体信息的版权合规性进行验证。
图12为本申请实施例提供的多媒体信息处理方法的使用场景示意图,其中,以多媒体信息是短视频的情况进行举例说明,终端(如图1示出的终端10-1和终端10-2)上设置有能够显示相应短视频的软件的客户端,例如短视频播放的客户端或插件,用户通过相应的客户端可以获得短视频并进行展示;终端通过网络300连接短视频服务器200,网络300可以是广域网或者局域网,又或者是二者的组合,使用无线链路实现数据传输。当然,用户也可以通过终端中的小程序上传短视频以供网络中的其他用户观看,这一过程中运营商的视频服务器需要对用户上传的短视频进行检测,对不同的视频信息进行比对和分析,例如确定用户上传的短视频版权是否合规,并对合规视频向不同的用户进行推荐,从而避免用户的短视频被盗播。
下面对本申请所提供的多媒体信息处理方法的使用过程进行说明,其中,参考图13,图13为本申请实施例中多媒体信息处理方法的使用过程示意图,将结合图13示出的步骤进行说明。
步骤1301:获取视频中的音频。
这里,可以对获取的视频进行解析以分离出视频中的音频,还可以通过预处理进程对音频进行预处理,例如确定与音频相对应的梅尔频谱图。
步骤1302:获取视频信息处理模型(对应上文的多媒体信息处理模型)的训练样本集合。
步骤1303:对视频信息处理模型进行训练,确定相应的模型参数(网络参数)。
这里,根据训练样本集合(如上文的第二训练样本集合)对视频信息处理模型进行训练,确定相应的模型参数。
步骤1304:将经过训练的视频信息处理模型部署在相应的视频检测服务器中。
这里,对于部署有经过训练的视频信息处理模型的视频检测服务器来说,可以通过经过训练的视频信息处理模型进行相关检测。
步骤1305:通过视频信息处理模型对不同的视频中的音频进行检测,以确定不同的视频是否相似。
以视频为短视频的情况为例,当确定目标短视频与源视频相似时,获取目标短视频的版权信息,例如获取用户通过终端10-1所运行的小程序上传的相应版权信息,或者根据版权信息在云服务器网络中的存储位置获取该版权信息。通过目标短视频的版权信息和源视频的版权信息,确定目标短视频的合法性。当目标短视频的版权信息和源视频的版权信息不一致时,发出警示信息。
当确定目标短视频与源视频不相似时,将目标短视频添加至视频源(对应上文的多媒体信息源),以作为视频源中的待推荐视频。对视频源中的所有待推荐视频的召回顺序进行排序,并基于待推荐视频的召回顺序的排序结果向目标用户进行视频推荐,如此,更加有利于原创视频的推送。
下面继续说明本申请实施例提供的多媒体信息处理装置2020实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器202的多媒体信息处理装置2020中的软件模块可以包括:信息传输模块2081,配置为对多媒体信息进行解析以分离出多媒体信息中的音频;信息处理模块2082,配置为对音频进行转换处理,得到与音频相对应的梅尔频谱图;根据音频相对应的梅尔频谱图,确定音频对应的音频特征向量;基于源多媒体信息中的源音频对应的音频特征向量、以及目标 多媒体信息中的目标音频对应的音频特征向量,确定目标多媒体信息与源多媒体信息的相似度。
在一些实施例中,信息传输模块2081,还配置为:对多媒体信息进行解析,得到多媒体信息的时序信息;根据多媒体信息的时序信息,对多媒体信息所对应的视频参数进行解析,得到与多媒体信息对应的播放时长参数与音轨信息参数;基于多媒体信息对应的播放时长参数与音轨信息参数,对多媒体信息进行抽取得到多媒体信息中的音频。
在一些实施例中,信息处理模块2082,还配置为:对音频进行声道转换处理,得到单声道音频数据;基于加窗函数对单声道音频数据进行短时傅里叶变换,得到相应的频谱图;根据时长参数对频谱图进行处理,得到与音频相对应的梅尔频谱图。
在一些实施例中,信息处理模块2082,还配置为:基于梅尔频谱图确定相应的输入三元组样本;通过多媒体信息处理模型的卷积层和最大值池化层对输入三元组样本交叉进行处理,得到不同输入三元组样本的降采样结果;通过多媒体信息处理模型的全连接层对降采样结果进行归一化处理,得到归一化结果;通过多媒体信息处理模型对归一化结果进行深度分解处理,得到与不同输入三元组样本相匹配的音频特征向量。
在一些实施例中,信息处理模块2082,还配置为:获取第一训练样本集合,其中第一训练样本集合包括采集的视频信息中的音频样本;对第一训练样本集合进行噪声添加,得到相应的第二训练样本集合;通过多媒体信息处理模型对第二训练样本集合进行处理,以确定多媒体信息处理模型的初始参数;响应于多媒体信息处理模型的初始参数,通过多媒体信息处理模型对第二训练样本集合进行处理,以确定多媒体信息处理模型的更新参数;根据多媒体信息处理模型的更新参数,通过第二训练样本集合对多媒体信息处理模型的网络参数进行迭代更新。
在一些实施例中,信息处理模块2082,还配置为:确定与多媒体信息处理模型的使用环境相匹配的动态噪声类型;根据动态噪声类型对第一训练样本集合进行噪声添加,以改变第一训练样本集合中音频样本的背景噪音、音量、采样率以及音质中的至少之一,得到相应的第二训练样本集合。
在一些实施例中,信息处理模块2082,还配置为:将第二训练样本集合中不同音频样本,代入多媒体信息处理模型的三元损失函数层网络所对应的损失函数;确定损失函数满足相应的收敛条件时对应三元损失函数层网络的参数;将三元损失函数层网络的参数作为多媒体信息处理模型的更新参数。
在一些实施例中,信息处理模块2082,还配置为:确定与多媒体信息处理模型中三元损失函数层网络相匹配的收敛条件;对三元损失函数层网络的网络参数进行迭代更新,直至三元损失函数层网络对应的损失函数满足收敛条件。
在一些实施例中,信息处理模块2082,还配置为:基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定对应的帧间相似度参数集合;确定帧间相似度参数集合中达到相似度阈值的音频帧数量;基于达到相似度阈值的音频帧数量,确定目标多媒体信息与源多媒体信息的相似度。
在一些实施例中,信息处理模块2082,还配置为:当确定目标多媒体信息与源多媒体信息相似时,获取目标多媒体信息的版权信息和源多媒体信息的版权信息;通过目标多媒体信息的版权信息和源多媒体信息的版权信息,确定目标多媒体信息的合法性;当目标多媒体信息的版权信息和源多媒体信息的版权信息不一致时,发出警示信息。
在一些实施例中,信息处理模块2082,还配置为:当确定目标多媒体信息与源多媒体信息不相似时,将目标多媒体信息添加至多媒体信息源;对多媒体信息源中的待推荐多媒体信息的召回顺序进行排序;基于待推荐多媒体信息的召回顺序的排序结果,向目标用户进行多媒体信息推荐。
在一些实施例中,信息处理模块2082,还配置为:将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息发送至区块链网络,以使区块链网络的节点将多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息填充至新区块,且当对新区块共识一致时,将新区块追加至区块链的尾部。
在一些实施例中,信息处理模块2082,还配置为:接收区块链网络中的其他节点的数据同步请求;响应于数据同步请求,对其他节点的权限进行验证;当其他节点的权限通过验证时,控制当前节点与其他节点之间进行数据同步,以使其他节点获取多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息。
在一些实施例中,信息处理模块2082,还配置为:响应于查询请求,解析查询请求以得到对应的对象标识;根据对象标识获取区块链网络中的目标区块内的权限信息;对权限信息与对象标识的匹配性进行校验;当权限信息与对象标识相匹配时,在区块链网络中获取相应的多媒体信息的标识、 多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息;将所获取的多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息向相应的客户端进行发送,以使客户端获取多媒体信息的标识、多媒体信息中的音频对应的音频特征向量和多媒体信息的版权信息。
本申请实施例至少具有以下技术效果:本申请实施例通过确定与音频相对应的梅尔频谱图,并根据梅尔频谱图确定音频对应的音频特征向量,由此,可以根据音频特征向量准确、有效地确定多媒体信息之间的相似度,提升了多媒体信息相似度判断的准确性。在多媒体信息为视频的情况下,减少了由于单一依靠视频图像的判断,在视频图像过度处理(如过度裁剪)时所导致的对视频相似度的误判。
以上,仅为本申请的实施例而已,并非用于限定本申请的保护范围,凡在本申请的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本申请的保护范围之内。

Claims (17)

  1. 一种多媒体信息处理方法,由电子设备执行,所述方法包括:
    对多媒体信息进行解析以分离出所述多媒体信息中的音频;
    对所述音频进行转换处理,得到与所述音频相对应的梅尔频谱图;
    根据所述音频相对应的梅尔频谱图,确定所述音频对应的音频特征向量;
    基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定所述目标多媒体信息与所述源多媒体信息的相似度。
  2. 根据权利要求1所述的方法,其中,所述对多媒体信息进行解析以分离出所述多媒体信息中的音频,包括:
    对多媒体信息进行解析,得到所述多媒体信息的时序信息;
    根据所述多媒体信息的时序信息,对所述多媒体信息所对应的视频参数进行解析,得到与所述多媒体信息对应的播放时长参数与音轨信息参数;
    基于所述多媒体信息对应的播放时长参数与音轨信息参数,对所述多媒体信息进行抽取得到所述多媒体信息中的音频。
  3. 根据权利要求1所述的方法,其中,所述对所述音频进行转换处理,得到与所述音频相对应的梅尔频谱图,包括:
    对所述音频进行声道转换处理,得到单声道音频数据;
    基于加窗函数对所述单声道音频数据进行短时傅里叶变换,得到相应的频谱图;
    根据时长参数对所述频谱图进行处理,得到与所述音频相对应的梅尔频谱图。
  4. 根据权利要求1所述的方法,其中,所述根据所述音频相对应的梅尔频谱图,确定所述音频对应的音频特征向量,包括:
    基于所述梅尔频谱图确定相应的输入三元组样本;
    通过多媒体信息处理模型的卷积层和最大值池化层对所述输入三元组样本交叉进行处理,得到不同所述输入三元组样本的降采样结果;
    通过所述多媒体信息处理模型的全连接层对所述降采样结果进行归一化处理,得到归一化结果;
    通过所述多媒体信息处理模型对所述归一化结果进行深度分解处理,得到与不同所述输入三元组样本相匹配的音频特征向量。
  5. 根据权利要求4所述的方法,其中,所述方法还包括:
    获取第一训练样本集合,其中所述第一训练样本集合包括采集的视频信息中的音频样本;
    对所述第一训练样本集合进行噪声添加,得到相应的第二训练样本集合;
    通过所述多媒体信息处理模型对所述第二训练样本集合进行处理,以确定所述多媒体信息处理模型的初始参数;
    响应于所述多媒体信息处理模型的初始参数,通过所述多媒体信息处理模型对所述第二训练样本集合进行处理,以确定所述多媒体信息处理模型的更新参数;
    根据所述多媒体信息处理模型的更新参数,通过所述第二训练样本集合对所述多媒体信息处理模型的网络参数进行迭代更新。
  6. 根据权利要求5所述的方法,其中,所述对所述第一训练样本集合进行噪声添加,得到相应的第二训练样本集合,包括:
    确定与所述多媒体信息处理模型的使用环境相匹配的动态噪声类型;
    根据所述动态噪声类型对所述第一训练样本集合进行噪声添加,以改变所述第一训练样本集合中音频样本的背景噪音、音量、采样率以及音质中的至少之一,得到相应的第二训练样本集合。
  7. 根据权利要求5所述的方法,其中,所述响应于所述多媒体信息处理模型的初始参数,通过所述多媒体信息处理模型对所述第二训练样本集合进行处理,以确定所述多媒体信息处理模型的更新参数,包括:
    将所述第二训练样本集合中不同音频样本,代入所述多媒体信息处理模型的三元损失函数层网络所对应的损失函数;
    确定所述损失函数满足相应的收敛条件时对应所述三元损失函数层网络的参数;
    将所述三元损失函数层网络的参数作为所述多媒体信息处理模型的更新参数。
  8. 根据权利要求5所述的方法,其中,所述根据所述多媒体信息处理模型的更新参数,通过所述第二训练样本集合对所述多媒体信息处理模型的网络参数进行迭代更新,包括:
    确定与所述多媒体信息处理模型中三元损失函数层网络相匹配的收敛条件;
    对所述三元损失函数层网络的网络参数进行迭代更新,直至所述三元损失函数层网络对应的损失函数满足所述收敛条件。
  9. 根据权利要求1所述的方法,其中,所述基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定所述目标多媒体信息与所述源多媒体信息的相似度,包括:
    基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定对应的帧间相似度参数集合;
    确定所述帧间相似度参数集合中达到相似度阈值的音频帧数量;
    基于所述达到相似度阈值的音频帧数量,确定所述目标多媒体信息与所述源多媒体信息的相似度。
  10. 根据权利要求1所述的方法,其中,所述方法还包括:
    当确定所述目标多媒体信息与所述源多媒体信息相似时,获取所述目标多媒体信息的版权信息和所述源多媒体信息的版权信息;
    通过所述目标多媒体信息的版权信息和所述源多媒体信息的版权信息,确定所述目标多媒体信息的合法性;
    当所述目标多媒体信息的版权信息和所述源多媒体信息的版权信息不一致时,发出警示信息。
  11. 根据权利要求1所述的方法,其中,所述方法还包括:
    当确定所述目标多媒体信息与所述源多媒体信息不相似时,将所述目标多媒体信息添加至多媒体信息源;
    对所述多媒体信息源中的待推荐多媒体信息的召回顺序进行排序;
    基于所述待推荐多媒体信息的召回顺序的排序结果,向目标用户进行多媒体信息推荐。
  12. 根据权利要求1-11任一项所述的方法,其中,所述方法还包括:
    将所述多媒体信息的标识、所述多媒体信息中的音频对应的音频特征向量和所述多媒体信息的版权信息发送至区块链网络,以使
    所述区块链网络的节点将所述多媒体信息的标识、所述多媒体信息中的音频对应的音频特征向量和所述多媒体信息的版权信息填充至新区块,且当对所述新区块共识一致时,将所述新区块追加至区块链的尾部。
  13. 根据权利要求12所述的方法,其中,所述方法还包括:
    接收所述区块链网络中的其他节点的数据同步请求;
    响应于所述数据同步请求,对所述其他节点的权限进行验证;
    当所述其他节点的权限通过验证时,控制当前节点与所述其他节点之间进行数据同步,以使所述其他节点获取所述多媒体信息的标识、所述多媒体信息中的音频对应的音频特征向量和所述多媒体信息的版权信息。
  14. 根据权利要求12所述的方法,其中,所述方法还包括:
    响应于查询请求,解析所述查询请求以得到对应的对象标识;
    根据所述对象标识获取所述区块链网络中的目标区块内的权限信息;
    对所述权限信息与所述对象标识的匹配性进行校验;
    当所述权限信息与所述对象标识相匹配时,在所述区块链网络中获取相应的所述多媒体信息的标识、所述多媒体信息中的音频对应的音频特征向量和所述多媒体信息的版权信息;
    将所获取的所述多媒体信息的标识、所述多媒体信息中的音频对应的音频特征向量和所述多媒体信息的版权信息向相应的客户端进行发送,以使所述客户端获取所述多媒体信息的标识、所述多媒体信息中的音频对应的音频特征向量和所述多媒体信息的版权信息。
  15. 一种多媒体信息处理装置,所述装置包括:
    信息传输模块,配置为对多媒体信息进行解析以分离出所述多媒体信息中的音频;
    信息处理模块,配置为:
    对所述音频进行转换处理,得到与所述音频相对应的梅尔频谱图;
    根据所述音频相对应的梅尔频谱图,确定所述音频对应的音频特征向量;
    基于源多媒体信息中的源音频对应的音频特征向量、以及目标多媒体信息中的目标音频对应的音频特征向量,确定所述目标多媒体信息与所述源多媒体信息的相似度。
  16. 一种电子设备,所述电子设备包括:
    存储器,用于存储可执行指令;
    处理器,用于运行所述存储器存储的可执行指令时,实现权利要求1至14任一项所述的多媒体信息处理方法。
  17. 一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时实现权利要求1至14任一项所述的多媒体信息处理方法。
PCT/CN2021/107117 2020-09-11 2021-07-19 一种多媒体信息处理方法、装置、电子设备及存储介质 WO2022052630A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21865686.6A EP4114012A4 (en) 2020-09-11 2021-07-19 METHOD AND DEVICE FOR PROCESSING MULTIMEDIA INFORMATION AND ELECTRONIC DEVICE AND STORAGE MEDIUM
US17/962,722 US11887619B2 (en) 2020-09-11 2022-10-10 Method and apparatus for detecting similarity between multimedia information, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010956391.XA CN112104892B (zh) 2020-09-11 2020-09-11 一种多媒体信息处理方法、装置、电子设备及存储介质
CN202010956391.X 2020-09-11

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/962,722 Continuation US11887619B2 (en) 2020-09-11 2022-10-10 Method and apparatus for detecting similarity between multimedia information, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022052630A1 true WO2022052630A1 (zh) 2022-03-17

Family

ID=73750851

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/107117 WO2022052630A1 (zh) 2020-09-11 2021-07-19 一种多媒体信息处理方法、装置、电子设备及存储介质

Country Status (4)

Country Link
US (1) US11887619B2 (zh)
EP (1) EP4114012A4 (zh)
CN (1) CN112104892B (zh)
WO (1) WO2022052630A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112104892B (zh) 2020-09-11 2021-12-10 腾讯科技(深圳)有限公司 一种多媒体信息处理方法、装置、电子设备及存储介质
CN112380377B (zh) * 2021-01-14 2021-04-13 腾讯科技(深圳)有限公司 一种音频推荐方法、装置、电子设备及计算机存储介质
CN112597321B (zh) * 2021-03-05 2022-02-22 腾讯科技(深圳)有限公司 基于区块链的多媒体处理方法及相关设备
CN113435328B (zh) * 2021-06-25 2024-05-31 上海众源网络有限公司 视频片段处理方法、装置、电子设备及可读存储介质
CN113192520B (zh) * 2021-07-01 2021-09-24 腾讯科技(深圳)有限公司 一种音频信息处理方法、装置、电子设备及存储介质
CN113327631B (zh) * 2021-07-15 2023-03-21 广州虎牙科技有限公司 一种情感识别模型的训练方法、情感识别方法及装置
CN113626850B (zh) * 2021-10-13 2022-03-11 北京百度网讯科技有限公司 基于联盟链的请求处理方法、装置、设备和存储介质
CN114036341B (zh) * 2022-01-10 2022-03-29 腾讯科技(深圳)有限公司 音乐标签的预测方法、相关设备
CN114090962B (zh) * 2022-01-24 2022-05-13 湖北长江传媒数字出版有限公司 一种基于大数据的智能出版***及方法
CN115278382B (zh) * 2022-06-29 2024-06-18 北京捷通华声科技股份有限公司 基于音频片段的视频片段确定方法及装置

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928233B1 (en) * 1999-01-29 2005-08-09 Sony Corporation Signal processing method and video signal processor for detecting and analyzing a pattern reflecting the semantics of the content of a signal
US20090319513A1 (en) * 2006-08-03 2009-12-24 Nec Corporation Similarity calculation device and information search device
CN104091598A (zh) * 2013-04-18 2014-10-08 腾讯科技(深圳)有限公司 一种音频文件的相似计算方法及装置
CN106126617A (zh) * 2016-06-22 2016-11-16 腾讯科技(深圳)有限公司 一种视频检测方法及服务器
CN108520078A (zh) * 2018-04-20 2018-09-11 百度在线网络技术(北京)有限公司 视频识别方法和装置
CN109002529A (zh) * 2018-07-17 2018-12-14 厦门美图之家科技有限公司 音频检索方法及装置
CN110047510A (zh) * 2019-04-15 2019-07-23 北京达佳互联信息技术有限公司 音频识别方法、装置、计算机设备及存储介质
CN110991391A (zh) * 2019-09-17 2020-04-10 腾讯科技(深圳)有限公司 一种基于区块链网络的信息处理方法及装置
CN111581437A (zh) * 2020-05-07 2020-08-25 腾讯科技(深圳)有限公司 一种视频检索方法及装置
CN112104892A (zh) * 2020-09-11 2020-12-18 腾讯科技(深圳)有限公司 一种多媒体信息处理方法、装置、电子设备及存储介质

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5129053B2 (ja) * 2007-07-27 2013-01-23 パナソニック株式会社 コンテンツ再生装置、コンテンツ再生方法、コンテンツ再生プログラム及び集積回路
US9299364B1 (en) * 2008-06-18 2016-03-29 Gracenote, Inc. Audio content fingerprinting based on two-dimensional constant Q-factor transform representation and robust audio identification for time-aligned applications
US9286909B2 (en) * 2011-06-06 2016-03-15 Bridge Mediatech, S.L. Method and system for robust audio hashing
US20140161263A1 (en) * 2012-12-10 2014-06-12 Microsoft Corporation Facilitating recognition of real-time content
US20140172429A1 (en) * 2012-12-14 2014-06-19 Microsoft Corporation Local recognition of content
US9788777B1 (en) * 2013-08-12 2017-10-17 The Neilsen Company (US), LLC Methods and apparatus to identify a mood of media
US9491522B1 (en) * 2013-12-31 2016-11-08 Google Inc. Methods, systems, and media for presenting supplemental content relating to media content on a content interface based on state information that indicates a subsequent visit to the content interface
NL2012567B1 (en) * 2014-04-04 2016-03-08 Teletrax B V Method and device for generating improved fingerprints.
MX2017007165A (es) * 2014-12-01 2017-11-17 Inscape Data Inc Sistema y metodo para identificacion continua de segmentos de medios.
US11277390B2 (en) * 2015-01-26 2022-03-15 Listat Ltd. Decentralized cybersecure privacy network for cloud communication, computing and global e-commerce
US20170097992A1 (en) * 2015-10-02 2017-04-06 Evergig Music S.A.S.U. Systems and methods for searching, comparing and/or matching digital audio files
US10515292B2 (en) * 2016-06-15 2019-12-24 Massachusetts Institute Of Technology Joint acoustic and visual processing
AU2016422515A1 (en) * 2016-09-09 2019-02-21 Microsoft Technology Licensing, Llc. Tracing objects across different parties
CN110322886A (zh) * 2018-03-29 2019-10-11 北京字节跳动网络技术有限公司 一种音频指纹提取方法及装置
CA3045675A1 (en) * 2018-06-07 2019-12-07 Alexander Sheung Lai Wong System and method for decentralized digital structured data storage, management, and authentication using blockchain
US10885159B2 (en) * 2018-07-09 2021-01-05 Dish Network L.L.C. Content anti-piracy management system and method
US11947593B2 (en) * 2018-09-28 2024-04-02 Sony Interactive Entertainment Inc. Sound categorization system
WO2020153234A1 (ja) * 2019-01-23 2020-07-30 ソニー株式会社 情報処理システム、情報処理方法、およびプログラム
US11158013B2 (en) * 2019-02-27 2021-10-26 Audible Magic Corporation Aggregated media rights platform with media item identification across media sharing platforms
CN109903773B (zh) * 2019-03-13 2021-01-08 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及存储介质
CN109913910B (zh) 2019-04-08 2020-12-08 北京科技大学 一种钛铁矿碳热-电解制备钛铁合金的方法
US10872170B2 (en) * 2019-05-15 2020-12-22 Advanced New Technologies Co., Ltd. Blockchain-based copyright distribution
US10904251B2 (en) * 2019-05-17 2021-01-26 Advanced New Technologies Co., Ltd. Blockchain-based copyright protection method and apparatus, and electronic device
CN110324726B (zh) * 2019-05-29 2022-02-18 北京奇艺世纪科技有限公司 模型生成、视频处理方法、装置、电子设备及存储介质
US11501787B2 (en) * 2019-08-22 2022-11-15 Google Llc Self-supervised audio representation learning for mobile devices
US11816151B2 (en) * 2020-05-15 2023-11-14 Audible Magic Corporation Music cover identification with lyrics for search, compliance, and licensing

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928233B1 (en) * 1999-01-29 2005-08-09 Sony Corporation Signal processing method and video signal processor for detecting and analyzing a pattern reflecting the semantics of the content of a signal
US20090319513A1 (en) * 2006-08-03 2009-12-24 Nec Corporation Similarity calculation device and information search device
CN104091598A (zh) * 2013-04-18 2014-10-08 腾讯科技(深圳)有限公司 一种音频文件的相似计算方法及装置
CN106126617A (zh) * 2016-06-22 2016-11-16 腾讯科技(深圳)有限公司 一种视频检测方法及服务器
CN108520078A (zh) * 2018-04-20 2018-09-11 百度在线网络技术(北京)有限公司 视频识别方法和装置
CN109002529A (zh) * 2018-07-17 2018-12-14 厦门美图之家科技有限公司 音频检索方法及装置
CN110047510A (zh) * 2019-04-15 2019-07-23 北京达佳互联信息技术有限公司 音频识别方法、装置、计算机设备及存储介质
CN110991391A (zh) * 2019-09-17 2020-04-10 腾讯科技(深圳)有限公司 一种基于区块链网络的信息处理方法及装置
CN111581437A (zh) * 2020-05-07 2020-08-25 腾讯科技(深圳)有限公司 一种视频检索方法及装置
CN112104892A (zh) * 2020-09-11 2020-12-18 腾讯科技(深圳)有限公司 一种多媒体信息处理方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4114012A4 *

Also Published As

Publication number Publication date
EP4114012A4 (en) 2023-08-02
EP4114012A1 (en) 2023-01-04
US20230031846A1 (en) 2023-02-02
CN112104892A (zh) 2020-12-18
CN112104892B (zh) 2021-12-10
US11887619B2 (en) 2024-01-30

Similar Documents

Publication Publication Date Title
WO2022052630A1 (zh) 一种多媒体信息处理方法、装置、电子设备及存储介质
CN110598651B (zh) 一种信息处理方法、装置及存储介质
AU2019265827B2 (en) Blockchain-based music originality analysis method and apparatus
WO2022037343A1 (zh) 一种视频信息处理方法、装置、电子设备及存储介质
CN110705683B (zh) 随机森林模型的构造方法、装置、电子设备及存储介质
US11088828B2 (en) Blockchain-based data evidence storage method and apparatus
CN105659324A (zh) 协作音频对话证明
WO2020037400A1 (en) System, method, and computer program for secure authentication of live video
CN113539299A (zh) 一种多媒体信息处理方法、装置、电子设备及存储介质
CN112163412B (zh) 数据校验方法、装置、电子设备及存储介质
US20230177070A1 (en) Tokenized voice authenticated narrated video descriptions
CN108391141B (zh) 用于输出信息的方法和装置
US20200410975A1 (en) Audio synthesis method, computer apparatus, and storage medium
CN112134883B (zh) 基于可信计算进行节点间信任关系快速认证的方法、装置及相关产品
WO2021078062A1 (zh) Ssl证书校验方法、装置、设备及计算机存储介质
CN110769024B (zh) 电子测试数据的同步存储方法及***
CN113129008B (zh) 数据处理方法、装置、计算机可读介质及电子设备
CN111107443A (zh) 一种dash分片文件合并方法、终端设备及存储介质
CN113762040B (zh) 视频识别方法、装置、存储介质及计算机设备
CN113852835A (zh) 直播音频处理方法、装置、电子设备以及存储介质
WO2020053635A1 (zh) 近场传输中的资源推荐方法及其装置
CN116468214B (zh) 一种基于故障事件处理过程的证据电子化方法及电子设备
CN115544170B (zh) 基于区块链的数据托管方法和装置、电子设备、介质
CN112417328B (zh) 一种网页监控方法及装置
CN117271832A (zh) 一种跨平台批量视频号管理方法、***、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21865686

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021865686

Country of ref document: EP

Effective date: 20220929

NENP Non-entry into the national phase

Ref country code: DE