WO2023070462A1 - 一种文件去重方法、装置和设备 - Google Patents

一种文件去重方法、装置和设备 Download PDF

Info

Publication number
WO2023070462A1
WO2023070462A1 PCT/CN2021/127162 CN2021127162W WO2023070462A1 WO 2023070462 A1 WO2023070462 A1 WO 2023070462A1 CN 2021127162 W CN2021127162 W CN 2021127162W WO 2023070462 A1 WO2023070462 A1 WO 2023070462A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
storage space
data
files
storage
Prior art date
Application number
PCT/CN2021/127162
Other languages
English (en)
French (fr)
Inventor
郭小东
张海波
陈咸彰
黄永兵
刘铎
谭玉娟
Original Assignee
华为技术有限公司
重庆大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 重庆大学 filed Critical 华为技术有限公司
Priority to PCT/CN2021/127162 priority Critical patent/WO2023070462A1/zh
Priority to CN202180103614.0A priority patent/CN118120212A/zh
Publication of WO2023070462A1 publication Critical patent/WO2023070462A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]

Definitions

  • the present application relates to the technical field of communication, and in particular to a method, device and equipment for deduplication of files.
  • Embodiments of the present application provide a file deduplication method, device, and device.
  • the method can automatically remove duplicate files and reduce storage space occupation; it is insensitive to applications and does not require users to perform complex operations, reducing system processing overhead.
  • the embodiment of the present application provides a file deduplication method, and the file deduplication method is implemented by a terminal device or a device deployed on a cloud.
  • the terminal device or the device deployed on the cloud obtains the write request, and the write request includes the first file; in response to the write request, stores the first file, and the first file is stored in the first storage space; determines whether the second storage space There is a second file, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.
  • the first storage space is located in the memory space
  • the second storage space is located in the external storage space (such as a disk).
  • the first file included in the write request is stored in an independent storage space (the first storage space), and it is judged whether the existing file stored in the second storage space is There is a file identical to the first file (that is, it is judged whether there is a duplicate file).
  • This method performs duplicate checks while obtaining write requests, and realizes online deduplication (also known as online file deduplication), which can make users and applications insensitive; and this method (online file deduplication) does not need to
  • the files that have been written into the external storage space (such as disks) are re-read into the cache and then deduplicated, which can reduce the number of times of repeated writing to the hard disk and avoid the overhead of hard disk writing caused by repeated files; and the method After the user turns on the file deduplication function, a duplicate check can be performed every time a write request is received, which avoids the user from repeatedly performing deduplication operations manually and can improve user experience.
  • the file deduplication method provided in the first aspect may be applied to a scenario where an application program of a terminal device performs a write operation.
  • the terminal device obtains the write request of the application program, and the write request includes the first file; in response to the write request, stores the first file in the first storage space; determines whether the second file exists in the second storage space, and the second The file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.
  • the terminal device can implement online file deduplication during the writing operation performed by the application program, thereby reducing the storage space occupied.
  • the file deduplication process is insensitive to the application, does not require the internal ecological cooperation of the terminal device, and does not require the user to perform complicated operations, and the system overhead is low.
  • the first storage space is used to perform file duplication checking operations, thereby realizing online file deduplication; and the first storage space is compatible with the existing file cache space, which is beneficial to project implementation; Buffer operations such as space allocation are simplified and postponed, which is beneficial to reduce system operation overhead.
  • the link identifier of the first file is associated with the second file, and the link identifier of the first file is used to obtain the first file, and then from the first storage space delete the first file.
  • the system can directly delete the duplicate files from the cache without generating additional data copies, which is beneficial to reduce system overhead; is associated with a file in , so that the file can also be found.
  • the second file is the same as the first file, which means that the characteristic information of the second file is the same as the characteristic information of the first file.
  • the feature information of the first file is determined according to the sampling data of the first file.
  • the sampled data is partial data obtained from the data of the first file through a sampling algorithm.
  • the feature information of the first file is determined according to the sampling data and file information of the first file.
  • the file information includes information such as file type and file size.
  • the feature information includes fingerprint information and/or file identification ID.
  • the feature information of the file is unique, and the feature information of the file is unique for each file.
  • the feature information of the first file is determined in response to an instruction to close the first file.
  • the process of determining the feature information of the file can be performed during the file closing operation after the writing operation is completed, which is beneficial to reduce system overhead.
  • the feature information of the first file is determined; according to the feature information of the first file, through the index directory, it is determined whether there is a third file in the index directory, the file name of the third file and the feature of the first file The information is the same, and the third file is associated with the storage address of the second file.
  • the index directory can be updated so that the index directory includes files that have been written to the disk, which is beneficial to more accurately judging whether there is a duplicate file in the system.
  • prompt information is generated, and the prompt information includes one or more of the following: a reminder of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files.
  • a record log is generated, and the record log includes one or more of the following contents: data in the index directory, the storage location corresponding to the first file identifier, data in the first storage space, and the duplicate files deleted. Freed storage capacity, number of deduplicated files, deduplicated file types.
  • an instruction is obtained, which indicates enabling the file deduplication function; in response to the instruction, an operation of obtaining a write request is performed.
  • a file deduplication function switch can be provided to the user, and the user only needs to turn on the switch to realize automatic file deduplication, and the user does not need to participate in the file deduplication process, which optimizes user experience.
  • the overall process of implementing the file deduplication method in the first aspect may be embedded in the main process of the file access process.
  • the embodiment of the present application provides a file search method, and the file search method is implemented by a terminal device or a device deployed on a cloud.
  • the terminal device or the device deployed on the cloud obtains the first file, and determines the characteristic information of the first file; according to the characteristic information of the first file, it is determined whether there is a third file in the index directory, and the file name of the third file and The feature information of the first file is the same, the third file is associated with the storage address of the second file, and the second file is stored in the second storage space.
  • the index directory is constructed in the form of files, the third file in the index directory corresponds to the second file stored in the second storage space, and the characteristic information of the second file is used as the file name of the third file, and the The third file is associated with the storage address of the second file, for example, the storage address of the second file may be stored in the third file.
  • the storage space required for the index directory stored in the form of a file is small, which greatly reduces the storage overhead; and the search speed of the index directory under this method is faster than that of the prior art, which can greatly improve system performance .
  • the characteristic information of the first file is determined according to the sampling data of the first file; wherein, the sampling data is partial data obtained from the data of the first file through a sampling algorithm.
  • the first file is stored in the second storage space, and a fourth file is added in the index directory, and the file name of the fourth file is the first
  • the feature information of a file the fourth file is associated with the storage address of the first file.
  • the link identifier of the first file is associated with the storage address of the second file, and the link identifier of the first file is used to obtain the first file.
  • prompt information is generated, and the prompt information includes one or more of the following: a reminder of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files.
  • a record log is generated, and the record log includes one or more of the following contents: data in the index directory, the storage location corresponding to the first file identifier, data in the first storage space, and the duplicate files deleted. Freed storage capacity, number of deduplicated files, deduplicated file types.
  • an instruction is obtained, which indicates enabling the file deduplication function; in response to the instruction, an operation of obtaining a write request is performed.
  • a file deduplication function switch can be provided to the user, and the user only needs to turn on the switch to realize automatic file deduplication, and the user does not need to participate in the file deduplication process, which optimizes user experience.
  • the overall process of executing the file search method of the second aspect may be embedded in the main flow of the file access process.
  • the embodiment of the present application provides a file deduplication device, and the file deduplication device includes a file operation module, a file cache module and an information processing module.
  • the file operation module is used to obtain the write request, and the write request includes the first file
  • the file cache module is used to store the first file in response to the write request, and the first file is stored in the first storage space
  • the information processing module is used to determine Whether there is a second file in the second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located in different layers of the storage system.
  • the file caching module is further configured to store the first file in the third storage space when the second file does not exist, and perform a buffer operation on the first file in the third storage space ; After the buffer operation is completed, store the first file in the second storage space.
  • the file caching module is further configured to perform a buffer operation on the first file in the second storage space when the second file does not exist; after performing the buffer operation, the first file stored in the second storage space.
  • the information processing module is further configured to associate the link identifier of the first file with the second file when the second file exists, and the link identifier of the first file is used to obtain the first file;
  • the file caching module is also used to delete the first file from the first storage space.
  • the feature information includes fingerprint information and/or file ID.
  • the feature information of the file is unique, and the feature information of the file is unique for each file.
  • the information processing module is further configured to determine the characteristic information of the first file according to the sampled data of the first file, where the sampled data is part of the data obtained from the data of the first file through a sampling algorithm.
  • the information processing module is also used to determine the feature information of the first file; according to the feature information of the first file, determine whether there is a third file in the index directory, the file name of the third file is the same as that of the first file The feature information is the same, and the third file is associated with the storage address of the second file in the second storage space.
  • the device for deduplicating files also includes a prompt module, the prompt module is used to generate prompt information, and the prompt information includes one or more of the following: prompts for deleted duplicate files, storage released by deleting duplicate files Capacity, number of duplicate files to delete, file type of duplicate files.
  • the file deduplication device further includes a generation module, and the generation module is used to generate a record log, and the record log includes one or more of the following contents: data in the index directory, storage corresponding to the first file identifier location, data in primary storage, storage capacity freed by deduplication, number of deduplicated files, file types of deduplicated files.
  • the device for deduplication of files further includes an execution module, and the execution module is configured to acquire an instruction indicating to enable the function of deduplication of files; in response to the instruction, perform an operation of acquiring a write request.
  • the module for implementing the file deduplication method provided in the above third aspect and any possible design thereof can also realize the beneficial effects of the file deduplication method provided in the first aspect.
  • the embodiment of the present application provides a file search device, and the file search device includes a file operation module and an information processing module.
  • the file operation module is used to obtain the first file
  • the information processing module is used to determine the characteristic information of the first file
  • the file operation module is also used to determine whether there is a third file in the index directory according to the characteristic information of the first file, and the first
  • the file names of the three files are the same as the feature information of the first file, and the third file is associated with the storage address of the second file in the second storage space.
  • the information processing module is used to determine the feature information of the first file, including:
  • the sampling data of the first file the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
  • the file search device further includes a file cache module, and the file cache module is used to store the first file in the second storage space when the third file does not exist in the index directory, and store the first file in the index A fourth file is added to the directory, the file name of the fourth file is the feature information of the first file, and the fourth file is associated with the storage address of the first file.
  • the information processing module is further configured to associate the link identifier of the first file with the second file when there is a third file in the index directory, and the link identifier of the first file is used to obtain the second file.
  • a file; the file cache module is also used to delete the first file from the first storage space.
  • the module for implementing the file search method provided in the above fourth aspect and any possible design thereof can also realize the beneficial effects of the file search method provided in the second aspect.
  • the embodiment of the present application provides a device, and the device may be a terminal device or a device deployed on a cloud.
  • the device includes one or more processors and memory; the memory is coupled with one or more processors, and the memory stores a computer program, and when the one or more processors execute the computer program, the device performs the following operations:
  • the second file It is determined whether a second file exists in the second storage space, the second file is the same as the first file, and the first storage space and the second storage space are located at different layers of the storage system.
  • the second storage space For the introduction of the first storage space, the second storage space, the sampling data of the first file, the feature information of the first file, the link identifier of the first file associated with the second file, the generation of prompt information, and the generation of recording logs, please refer to The corresponding description in the first aspect will not be repeated here.
  • the embodiment of the present application provides a device, and the device may be a terminal device or a device deployed on a cloud.
  • the device includes one or more processors and memory; the memory is coupled with one or more processors, and the memory stores a computer program, and when the one or more processors execute the computer program, the device performs the following operations:
  • the file name of the third file is the same as the feature information of the first file, and the third file is associated with the storage address of the second file in the second storage space .
  • the embodiment of the present application provides a computer-readable storage medium, the above-mentioned computer-readable storage medium stores a computer program, and the above-mentioned computer program is executed by a processor to realize the above-mentioned first aspect or second aspect and its possible realization The method described in any one of the methods.
  • the embodiment of the present application provides a chip system
  • the chip system includes a processor, and may also include a memory, which is used to implement the method described in the first aspect or the second aspect of the terminal device or the terminal device deployed on the cloud the functionality of the device.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • the embodiments of the present application provide a computer program product, including instructions, which, when the instructions are run on a computer, cause the computer to execute any one of the first aspect or the second aspect and possible implementations thereof the method described.
  • FIG. 1a is a schematic flow diagram of a user manually performing a file deduplication function
  • Figure 1b is a schematic diagram of a file abnormality after the user manually performs the file deduplication function
  • FIG. 2 is a schematic diagram of a hardware structure of a terminal device provided in an embodiment of the present application
  • FIG. 3 is a schematic diagram of a software structure of a terminal device provided in an embodiment of the present application.
  • Figure 4a is a modular flow chart for implementing a method for deduplication of files provided by the embodiment of the present application
  • Fig. 4b is another modular flow chart for implementing a file deduplication method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of an index directory provided by the embodiment of the present application.
  • FIG. 6 is a schematic flow diagram of implementing a file deduplication function for an application program in an Android system terminal provided by an embodiment of the present application;
  • FIG. 7a is a schematic diagram of a process for performing a write operation in the first storage space provided by an embodiment of the present application.
  • FIG. 7b is a schematic diagram of another process for performing a write operation in the first storage space provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of determining feature information based on sampled data provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of associating a link identifier of a file with the same file provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a link correspondence relationship provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of an output file access authorization interface provided by the embodiment of the present application.
  • FIG. 12 is a schematic diagram of an external device calling a file deduplication function provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of a file deduplication method provided in the embodiment of the present application.
  • FIG. 14 is a schematic flowchart of a file search method provided in the embodiment of the present application.
  • Fig. 15 is a schematic diagram of a device provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of a file deduplication device provided in the embodiment of the present application.
  • FIG. 17 is a schematic diagram of a file search device provided by an embodiment of the present application.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations, and any embodiment or design described as “exemplary” or “for example” should not be interpreted It is more preferred or more advantageous than other embodiments or design solutions.
  • the use of words such as “exemplary” or “for example” is intended to present related concepts in a specific manner for easy understanding.
  • the mobile phone cleaning tool can provide a user entry, and the user can scan and identify duplicate files in the terminal device after manual activation, obtain the scanning result, and provide the scanning result to the user. Users manually confirm and delete duplicate files one by one.
  • FIG. 1a shows a process when a user manually performs a file deduplication function.
  • the display interface of the terminal device will display information such as storage space occupied by the system, junk files, duplicate files, etc. at present.
  • FIG. 1a shows a situation where the file is abnormal after the user manually operates and executes the file deduplication function. Since the user directly deletes the duplicate files when cleaning the duplicate files, when the user opens the social software interaction window to search for pictures again, the interaction window cannot display the original picture normally.
  • APFS application program interface
  • API application interface
  • APFS Apple file system
  • APFS has a copy-on-write feature. If a user operation is to copy a file stored on APFS and copy it to another folder on the same APFS file system, APFS will create a new file marked "copy-on-write" and point to all the original files. storage.
  • APFS does not try to determine whether an existing file or a file copied from an external source matches any file already on the file system.
  • the solution needs to provide an API, which needs to be modified in cooperation with the application ecology, which greatly limits the application scenarios.
  • an embodiment of the present application provides a file deduplication method, which can effectively remove duplicate files and reduce storage space occupation; and when the file deduplication method is applied to a terminal device, the The application is insensitive and does not require users to perform complex operations, reducing the processing overhead of the system.
  • the file deduplication method provided in the embodiment of the present application can be applied to a terminal device, or deployed in a device on the cloud.
  • the method for deduplication of files may also be applied to a scenario of deduplication of files on the cloud controlled by a terminal device.
  • the exemplary terminal devices provided in the following embodiments of the present application are firstly introduced below.
  • FIG. 2 shows a schematic structural diagram of a terminal device 100 .
  • the terminal device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, bone conduction sensor 180M, etc.
  • the structure shown in the embodiment of the present application does not constitute a specific limitation on the terminal device 100 .
  • the terminal device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit
  • the controller can generate an operation control signal according to the instruction opcode and timing signal, and complete the control of fetching and executing the instruction.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is a cache memory.
  • the memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 110 is reduced, thereby improving the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transmitter (universal asynchronous receiver/transmitter, UART) interface, mobile industry processor interface (mobile industry processor interface, MIPI), general-purpose input and output (general-purpose input/output, GPIO) interface, subscriber identity module (subscriber identity module, SIM) interface, and /or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input and output
  • subscriber identity module subscriber identity module
  • SIM subscriber identity module
  • USB universal serial bus
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
  • MIPI interface includes camera serial interface (camera serial interface, CSI), display serial interface (display serial interface, DSI), etc.
  • the processor 110 communicates with the camera 193 through a CSI interface to realize the shooting function of the terminal device 100 .
  • the processor 110 communicates with the display screen 194 through the DSI interface to realize the display function of the terminal device 100 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface can be used to connect the processor 110 with the camera 193 , the display screen 194 , the wireless communication module 160 , the audio module 170 , the sensor module 180 and so on.
  • the GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface 130 is an interface conforming to the USB standard specification, specifically, it can be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 130 can be used to connect a charger to charge the terminal device 100, and can also be used to transmit data between the terminal device 100 and peripheral devices. It can also be used to connect headphones and play audio through them. This interface can also be used to connect other terminal devices, such as AR devices.
  • the interface connection relationship between the modules shown in the embodiment of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 .
  • the terminal device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the terminal device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos and the like.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc.
  • the terminal device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. Such as saving music, video and other files in the external memory card.
  • the internal memory 121 may be used to store computer-executable program codes including instructions.
  • the internal memory 121 may include an area for storing programs and an area for storing data.
  • the stored program area can store an operating system, at least one application program required by a function (such as a sound playing function, an image playing function, etc.) and the like.
  • the storage data area can store data created during the use of the terminal device 100 (such as audio data, phonebook, etc.) and the like.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (universal flash storage, UFS) and the like.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a micro-kernel architecture, a micro-service architecture, or a cloud architecture.
  • an Android system with a layered architecture is taken as an example to illustrate the software structure of the terminal device 100 .
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate through software interfaces.
  • the Android system is divided into four layers, which are respectively the application program layer, the application program framework layer, the Android runtime (Android runtime) and the system library, and the kernel layer from top to bottom.
  • the application layer can consist of a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, short message and multi-screen agent.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include window managers, content providers, view systems, phone managers, resource managers, notification managers and multi-screen frameworks, etc.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make it accessible to applications.
  • Said data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebook, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on.
  • the view system can be used to build applications.
  • a display interface can consist of one or more views.
  • a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide the communication function of the terminal device 100 .
  • the management of call status including connected, hung up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify the download completion, message reminder, etc.
  • the notification manager can also be a notification that appears on the top status bar of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, a text message is displayed in the status bar, a prompt sound is issued, the terminal device vibrates, and the indicator light flashes, etc.
  • the multi-screen framework is used to notify the "multi-screen agent” of the application layer of each event that the terminal device 100 establishes a connection with the large-screen device, and can also be used to assist the "multi-screen agent” in response to the instructions of the "multi-screen agent” of the application layer. "Multi-screen agent" to obtain data information.
  • the Android Runtime includes core library and virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application program layer and the application program framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • a system library can include multiple function modules. For example: surface manager (surface manager), media library (media libraries), 3D graphics processing library, 2D graphics engine, etc.
  • the surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of various commonly used audio and video formats, as well as still image files, etc.
  • the media library can support multiple audio and video encoding formats.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing, etc.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer includes at least a display driver, a camera driver, an audio driver, and a sensor driver.
  • Fig. 4a is a modular flow chart of a method for implementing file deduplication provided by the embodiment of the present application.
  • FIG. 4a is described by taking the internal modularization process of the terminal device as an example. It can be understood that when the file deduplication method provided by the embodiment of the present application is applied to the cloud, or is applied to the interaction scenario between the terminal and the cloud, there is also a modular process similar to FIG. 4a.
  • the existing file access process in the terminal device includes: when the application initiates a file access request, the system directly writes the file in the file access request to the file cache in the VFS through a write operation (write), and then writes the file access The file in the request is written to the file system.
  • the modular process for implementing the file deduplication method shown in FIG. 4a mainly includes a file operation module, a file cache module, an information processing module, a file index module, and a VFS.
  • the file cache module shown in Figure 4a is a new cache module in the memory space, which is used to intercept the write operation of the system and cache the file in the write operation; and combine the information processing module and file
  • the indexing module realizes the calculation of feature information for the cached files, judges whether the file is a duplicate file according to the feature information, and deduplicates the duplicate files online.
  • the non-duplicate files will continue to be written into the VFS, and then written into the file system/block device layer/driver/ In the flash memory, complete the file access process.
  • the file cache module shown in Figure 4a is mainly used to perform file comparison and file deduplication operations, and the cache area operations in the file access process (such as setting flags, writing checks, and space allocation) are still performed by the VFS. file cache to execute.
  • FIG. 4b is a modular flow chart of another method for implementing file deduplication provided by the embodiment of the present application.
  • FIG. 4b is described by taking the internal modularization process of the terminal device as an example. It can be understood that when the file deduplication method provided by the embodiment of the present application is applied to the cloud, or is applied to the interaction scenario between the terminal and the cloud, there is also a modular process similar to FIG. 4b.
  • the file cache module shown in Figure 4b has enhanced the original file cache, such as adding functions such as calculating feature information for cached files, file comparison, and file deduplication.
  • buffer operations in the file access process are also performed by the file cache module shown in Figure 4b, but the order of execution is the same as that of existing There is a delay compared to the file access process. That is to say, the file cache in the VFS shown in FIG. 4b will not perform write operations (for example, no buffer operations will be performed).
  • the file deduplication method process provided by the embodiment of the present application can be embedded in the existing file access process, and does not require an independent background thread, which is beneficial Reduce system write overhead.
  • the embodiment of the present application creates a new file caching module, which is used to realize deduplication of online files.
  • File operation module used to intercept the file access request of the application, call the file cache module to cache data, call the information processing module to identify duplicate files, and combine the file cache module and information processing module to remove duplicate files or save non-duplicate files.
  • File cache module used to build an independent self-built file cache space, and cache intercepted files through the self-built file cache space. For example, use the method shown in Figure 4a to create a new cache space in the existing memory space, cache and store the intercepted file data; or use the self-built file cache space to replace the files in the VFS file cache in the way shown in Figure 4b Cache, used to store intercepted file data.
  • Information processing module used to obtain file data from the file cache module and calculate feature information of the file, and also to initiate a feature information retrieval request or a request for adding feature information to the file index module.
  • File indexing module used to construct and maintain the index directory, and retrieve target characteristic information in the index directory.
  • the index directory can be regarded as a kind of database, and the index directory does not occupy memory.
  • File directory used to record files stored in the file system.
  • the directory items in the file directory include but are not limited to the file name, the link identifier of the file, the number of repetitions of the file, and the like.
  • File characteristic information information used to indicate that each file is unique.
  • the feature information of the file may include but not limited to fingerprint, file ID and so on.
  • fingerprint 1 of file 1 and fingerprint 2 of file 2 are different, that is, fingerprint 1 is used to identify file 1
  • fingerprint 2 is used to identify file 2.
  • file 1 and file 2 have the same content (including but not limited to file 1 and file 2 have the same content and the same file name, and file 1 and file 2 have the same content but different file names)
  • file 1 and file 2 File 2 has the same fingerprint (for example, both are fingerprint 1).
  • Index directory a data access mode, creating a directory in the system as an index directory.
  • the index directory in this embodiment of the present application may be an index table of feature information.
  • the index directory is constructed and maintained by the file index module in an indexing manner based on the file directory.
  • the index directory includes one or more feature information indexes, for example, includes multiple fingerprint indexes.
  • Each fingerprint index corresponds to a file in an index directory
  • the file name is the fingerprint
  • the link identifier (inode) corresponding to the file indicates the inode of the file corresponding to the fingerprint.
  • FIG. 5 is a schematic diagram of an index directory provided by the embodiment of the present application.
  • the system includes file A, file B and file C, the link identifier of file A is inode1, the link identifier of file B is inode2, and the link identifier of file C is inode3.
  • file A first calculate the feature information of file A (that is, calculate the fingerprint of file A), generate fingerprint A1, and fingerprint A1 points to the link identifier inode1 of file A, then generate a fingerprint index in the index directory : Fingerprint A1-inode1.
  • other fingerprint indexes in the index directory are generated: fingerprint B2-inode2, fingerprint C3-inode3, etc., as shown in FIG. 5 .
  • the location of the file can be obtained directly through the link identifier when searching the index directory, which is beneficial to realize more efficient file search.
  • FIG. 6 is a schematic flow diagram of implementing a file deduplication function for an application program in a terminal device using an Android system according to an embodiment of the present application.
  • the terminal device can execute the file deduplication method during the file writing process.
  • the specific process is provided by the file operation module, information processing module, file cache
  • the interaction between the module and the file index module includes the following steps:
  • the file operation module obtains the write request, and the write request includes the first file.
  • the file operation module calls the file cache module to store the first file in the first storage space.
  • the file operation module when the file operation module detects the write request of the application, it can intercept the write request, and cache the first file in the write request to the newly added file A cache module (the first storage space). In the file cache module, operations such as calculating feature information, comparing duplicate files, and removing duplicate files are performed, as shown in Figure 7a. After the file cache module executes the file deduplication operation, it uses the standard write function system call to cache the file in the write request to the VFS (the third storage space), and continues to perform the buffer operation in the VFS. Wherein, the cache area operation in Fig.
  • FIG. 7a refers to the write request operation not performed in the file cache module, including but not limited to setting flag bits, write check and space allocation, data write-back and other operations.
  • the cache area operation in Figure 7a is the same as the cache area operation in the existing write request. For example, a file is divided into multiple pages (page), and each page is executed to set flags, write checks and space allocation, and data Write back and other operations. When multiple pages of the same file are executed with the above buffer operations, the file will be written to the disk, and the system will release the memory occupied by the file.
  • the process shown in Figure 7a adopts the two-time cache serial mode, and embeds the interception cache, calculation and deduplication functions in the existing cache; according to the characteristic information of the file, the deduplication operation is performed on the duplicate file, and no longer sends to the system Continue to write duplicate files, discard duplicate files directly from memory; continue to write to the system for non-duplicate files.
  • the file operation module when the file operation module detects a write request from the application, the file operation module defines the system and calls the caching (caching) function, and first builds a self-built file cache (section One storage space); through the file cache module, based on the copy_from_user function, the intercepted first file is cached to the self-built file cache at one time, as shown in Figure 7b.
  • one-time caching refers to caching all the pages of the same file to the self-built file cache instead of caching each page one by one.
  • buffer operations are deferred and simplified.
  • the buffer operation includes setting flags M times, writing check and space allocation once, and writing data back N times.
  • the characteristic information of the cached files can be calculated in the file caching module shown in FIG. 7b, so as to judge whether the cached files are duplicate files. If it is a duplicate file, discard the duplicate file from the memory; if it is a non-duplicate file, continue to write to the system.
  • operations such as buffer operation, feature information calculation, and duplicate file removal may be performed during the closing operation.
  • the closing operation is a file operation performed after the writing operation.
  • the writing operation such as writing the file into the self-built file cache
  • the system can perform the closing operation.
  • the closing operation continue to execute the file shown in Figure 7b
  • Operations such as cache area operations, feature information calculation, and duplicate file removal can help reduce system write operation overhead.
  • the information processing module determines the feature information of the first file through a sampling algorithm. Specifically, the information processing module adopts a sample hash algorithm to obtain sample data of the first file, and determine feature information of the first file according to the sample data of the first file. It can be seen that the information processing module only needs to sample a small amount of file data to obtain feature information, which is beneficial to reduce system overhead.
  • the information processing module may also determine the characteristic information of the first file according to the sampling data of the first file and the file information of the first file.
  • the feature information may include but not limited to fingerprint information, file ID, etc.
  • the file information may include but not limited to file type, file size, etc. It can be understood that the characteristic information of the first file calculated and determined in combination with the sampled data of the first file and the file information of the first file can better reflect the uniqueness of the first file.
  • FIG. 8 is a schematic diagram of sampling and calculating characteristic information provided by an embodiment of the present application.
  • the first storage space can be regarded as data in a tree structure, and files are stored in pages.
  • the information processing module can obtain the sampling data of the file through the sampling hash algorithm.
  • the partial data of sampling page1, page3 and page5 respectively constitute the first segment cyclic redundancy check (cyclic redundancy check, CRC), the middle segment CRC and the tail segment CRC of the sampled data, as shown in FIG. 8 .
  • CRC cyclic redundancy check
  • middle segment CRC middle segment CRC
  • tail segment CRC of the sampled data
  • feature information is determined, for example, it is also called a fingerprint (fingerprint, FP) of the file.
  • the information processing module keeps the overhead of calculating characteristic information basically stable through sampling calculation, thereby reducing the impact of sampling and calculating characteristic information on the writing performance of the storage system.
  • the information processing module judges whether there is a second file in the second storage space according to the first file, and the second file is the same as the first file.
  • the specific judging method includes: the information processing module determines the feature information of the first file, and determines whether there is a third file in the index directory according to the feature information of the first file, and the file name of the third file is the same as the first file.
  • the feature information of one file is the same, and the third file is associated with the storage address of the second file in the second storage space.
  • the feature information of the second file is the same as that of the first file in the second storage space, it means that the second file is the same as the first file, and the first file is a duplicate file.
  • feature information is unique information. When the feature information of the first file is the same as that of the second file, it can be determined that the first file and the second file are the same file.
  • the file operation module associates the link identifier of the first file with the second file, and the link identifier of the first file is used to obtain the first file. That is to say, when the first file is a duplicate file, the link identifier of the first file is associated with the second file, so that when the first file is searched, the second file identical to the first file can be obtained. After the link identifier of the first file is associated with the second file, even if the first file is deleted, the same file (that is, the second file) can be found through the link identifier of the first file, thereby ensuring the accuracy of the file orientation path sex.
  • FIG. 9 is a schematic diagram of an operation process for duplicate files provided by the embodiment of the present application.
  • the left part in FIG. 9 is a file access list, which shows the files included in the write request and the link identifiers of the files.
  • the file access list includes two columns, the first column is the file name, and the second column is the link identifier (inode) of the file.
  • the link identifier of the file is used to obtain the file.
  • the right part of FIG. 9 shows some directory entries of the file directory (including the link identifier of the file and the number of repetitions of writing the file). It can be understood that the file directory is stored in the second storage space. For example, inode1 of file A included in the write request.
  • the terminal device stores the file A in the first storage space, and judges whether there is a second file in the second storage space, and the second file is the same as the file A.
  • the specific judgment method for example, the information processing module judges whether there is a second file in the second storage space according to the characteristic information of the file A, and the characteristic information of the second file is the same as the characteristic information of the file A. If there is no second file, it means that file A is not a duplicate file.
  • the file included in the write request again is file D, and the link identifier of file D is inode1.
  • the terminal device stores the file D in the first storage space, and judges whether there is a second file in the second storage space, and the second file is the same as the file D.
  • the specific judgment method for example, the information processing module judges whether there is a second file in the second storage space according to the characteristic information of the file D, and the characteristic information of the second file is the same as the characteristic information of the file D. If the feature information of file A and file D is the same, it means that file D is the same as file A, and file D is a duplicate file.
  • the file operation module associates the link identifier of file D with the link identifier of file A. For example, inode1 of file D points to repeated inode1. At this time, the number of file write repetitions corresponding to inode1 is updated to 2, as It is shown in the second row and the second column of the table on the right side of Fig. 9 .
  • FIG. 10 shows a link correspondence after file deduplication.
  • the number of repetitions of inode1 is 2, which means that the same files are all linked to inode1.
  • the file system only needs to store the same file once.
  • the duplicate files will eventually be discarded from the memory, and no external storage write operations will be generated, so that low-overhead file deduplication can be completed in the file access path.
  • the link correspondence shown in FIG. 10 still includes file D, so it is indifferent to upper-layer applications.
  • the operations of the file index module on the index directory may include but not limited to creating fingerprints, inserting fingerprints, retrieving fingerprints, deleting fingerprints, and the like.
  • creating an index directory a file in the index directory is created according to the characteristic information of the file, and the file name is a fingerprint.
  • non-duplicate files insert a file into the index directory according to the characteristic information of the non-duplicate files, and the file name is the fingerprint of the non-duplicate files.
  • the above steps may specifically be:
  • the file operation module judges whether the current write request is a write request sent by social software according to the application ID of the process; if it is a write request sent by social software, the file operation module intercepts it The write request calls the file cache module to create a unique cache space (first storage space) for the target file in the kernel for caching its write data.
  • the information processing module calls the sampling data of the first file in the first storage space to determine the characteristic information of the first file; and Whether there is feature information of the second file in the index directory is searched, and the feature information of the second file is the same as the feature information of the first file. If the same characteristic information is retrieved in the index directory, it is determined that the first file is a duplicate file, and the file operation module executes the operation of removing duplicate files as shown in FIG. 9 .
  • the file operation module calls the first file in the first storage space to replace the cached data in the second storage space in the file system, And set the flag bit, so that the data of the first file can be synchronized back to the flash memory by the background thread of the file system.
  • Table 1 is a storage space comparison table provided by the embodiment of the present application. Among them, Table 1 shows the comparison of the space occupied by the non-deduplication device and the space occupied by the deduplication device after multiple operations. Among them, multiple operations may include but are not limited to: using social software to send multiple times (video/PPT/picture files, etc.), using a browser to save files to system storage multiple times, calling video/PPT/picture multiple times from one application to Other applications (such as saving pictures from social software to the gallery, calling files from the gallery to social software).
  • Table 1 Storage space comparison table
  • the operation process shown in FIG. 6 is the operation of the internal system of the terminal device, which is invisible to the user.
  • the terminal device can also display the effect of file deduplication to users through interface display or voice prompts.
  • the terminal device disables the file deduplication function by default, and the file deduplication function needs to be enabled after user authorization.
  • the specific implementation manner may be to obtain an instruction, which indicates to enable the file deduplication function; in response to the instruction, perform an operation of obtaining a write request.
  • the terminal device provides a switch button for the file deduplication function in related operations such as system settings, or prompts the user whether to enable the file deduplication function during the installation and upgrade of a new system. If the user decides to enable the file deduplication function, the user can turn on the switch button of the file deduplication function in the system settings; for the terminal device, the user's operation is converted into an instruction, which instructs to enable the file deduplication function. In response to this instruction, an operation of acquiring a write request is performed.
  • the terminal device may output a user prompt.
  • the user prompts may include but are not limited to: the prompt system can automatically realize application transparency in real time (or at regular intervals), without user participation, and with extremely low overhead deduplication to implement functions related to storage saving, as shown in Figure 11.
  • the terminal device can output user prompts through voice broadcast, and the broadcast system can automatically realize the file deduplication function in real time (or regularly) to the user.
  • the terminal device can generate prompt information, which may include but not limited to: prompts for deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, duplicate file file type, etc.
  • prompt information may include but not limited to: prompts for deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, duplicate file file type, etc.
  • the file deduplication prompt information is output in the interface where the user authorizes the file deduplication function to be enabled.
  • the file deduplication prompt information includes but is not limited to: the prompt system presents statistics based on accumulation, year, month, day, etc. No need to participate) Automatically optimize the storage space of 20GB, optimize 1000 groups of files with the same content, and the category is video, etc., as shown in Figure 11.
  • the operation process shown in Figure 6 is the operation of the internal system of the terminal device.
  • the terminal device can also generate a record log.
  • the record log includes but is not limited to: the data in the index directory, the first A storage location corresponding to a file identifier, data in the first storage space, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of deleted duplicate files.
  • the terminal device can generate a record log of the file deduplication function.
  • the record log includes data in the index directory (for example, the respective characteristic information and file addresses of one or more files included in the index directory, which can directly provide the characteristic information value and the file address value without displaying the data structure of the index directory),
  • the specific value of the storage capacity released by deleted duplicate files for example, the storage capacity released by deleted duplicate files is 6GB
  • the number of deleted duplicate files for example, 1000 groups of deleted duplicate files
  • the terminal device provides an API to the external device, so that the external device can call the file deduplication function through the API.
  • the terminal device provides a debugging API so that external devices can call the file deduplication function, such as calling the file operation module and information processing module through the API, so that the external device can execute File deduplication function, as shown in Figure 12.
  • the external device in this implementation manner can be, for example, a server. When the server calls the file deduplication function through the API, automatic file deduplication can be realized on the server, and duplicate files can be effectively removed.
  • FIG. 13 is a schematic flow diagram of a file deduplication method provided in an embodiment of the present application.
  • the process of the file deduplication method is executed by a terminal device or a device deployed on the cloud, and includes the following steps:
  • the write request is used to request to write a file
  • the method of requesting to write a file may be that an application program initiates a file access request, for example, a write operation is performed through a control signal such as a pwrite function.
  • the first file included in the write request may be cached.
  • the first file included in the write request may be cached.
  • the first storage space and the second storage space are located at different layers of the storage system, which means that the first storage space and the second storage space are different in levels.
  • the first storage space is a memory space (such as a cache)
  • the second storage space is an external storage space (such as a disk). That is to say, during the file access process, the first file in the write request is temporarily stored in the memory space and not written into the external storage space, which is beneficial to reduce the overhead of writing to the external storage space. And after judging whether the first file is a duplicate file, if it is a duplicate file, the first file is directly deleted from the memory space to realize online file deduplication.
  • the characteristic information of the file is determined by sampling part of the data of the file.
  • the terminal device determines feature information of the first file according to the sampling data of the first file. For a specific implementation manner, refer to a method for determining characteristic information by sampling data shown in FIG. 8 , which will not be repeated here.
  • the first file in the absence of the second file, the first file is stored in the third storage space, and a buffer operation is performed on the first file in the third storage space; after the buffer operation is completed, Store the first file in the second storage space.
  • the first storage space refers to the cache space occupied by the file cache module
  • the third storage space refers to the file cache in the VFS.
  • the data structure of the first storage space is the same as the data structure of the third storage space.
  • the first storage space adopts a cache data structure, and operations of caching files can be performed in the first storage space;
  • the third storage space also adopts a cache data structure, and operations of caching files can also be performed in the third storage space.
  • the first storage space includes the cache space occupied by the file cache module and the file cache in the VFS.
  • the first storage space includes the cache space occupied by the file cache module and the file cache in the VFS.
  • there is only one data copy in the whole deduplication operation process for the specific implementation mode, refer to the corresponding descriptions in FIG. 4b and FIG. 7b , which will not be repeated here.
  • the cache area operation is performed, the first file is written from the memory space to the external storage space to complete the file access process.
  • the link identifier of the first file is associated with the second file, and the first file is deleted from the first storage space.
  • the link identifier of the first file is used to acquire the first file.
  • the file name of the third file is the same as the feature information of the first file
  • the third file The file is associated with the storage address of the second file in the second storage space.
  • the index directory is shown in FIG. 5 .
  • the feature information of the first file is calculated as fingerprint A1.
  • the fingerprint A1 exists in the index directory. It means that the file names of the first file and the third file are the same, so it can be deduced that the file A associated with the third file is the same file as the first file, that is, the first file is a duplicate file.
  • the link identifier of the first file is associated with the second file.
  • FIG. 9 refer to a manner of file association shown in FIG. 9 , which will not be repeated here.
  • the first file is written into the file system according to a normal file access process.
  • a fourth file is newly created in the index directory
  • the file name of the fourth file is the characteristic information of the first file
  • the fourth file and the first file are in the The storage address in the second storage space is associated. That is to say, when the first file is not a duplicate file, a new fingerprint can be inserted into the index directory, thereby facilitating subsequent judgment of other files by the terminal device.
  • the intercepted write request includes the fifth file
  • the file deduplication method further includes the following steps:
  • Prompt information is generated, and the prompt information includes one or more of the following: a prompt of deleted duplicate files, storage capacity released by deleting duplicate files, number of deleted duplicate files, and file types of duplicate files.
  • a prompt of deleted duplicate files storage capacity released by deleting duplicate files
  • number of deleted duplicate files number of deleted duplicate files
  • file types of duplicate files file types of duplicate files.
  • the file deduplication method further includes the following steps:
  • Generate a record log which includes one or more of the following: data in the index directory, storage location corresponding to the first file identifier, data in the first storage space, storage capacity released by deleting duplicate files, deleting duplicate files The number of duplicate files and the file types of the deleted files.
  • the file deduplication method further includes the following steps:
  • the instruction instructs to enable the file deduplication function
  • the embodiment of the present application provides a file deduplication method.
  • the file deduplication method stores the first file in the write request in the first storage space by obtaining the write request, and judges whether there is a second file in the second storage space. file, the second file is the same as the first file.
  • the method can effectively remove duplicate files of the terminal device and reduce storage space occupation; it is insensitive to applications and does not require users to perform complicated operations, thereby reducing system processing overhead.
  • the same second file can also be queried through the link identifier of the first file, so that the access process of the file is not affected.
  • FIG. 14 is a schematic flow chart of a file search method provided in an embodiment of the present application.
  • the file search method can also be executed by a terminal device or a device deployed on the cloud, and includes the following steps:
  • the first file in this embodiment may be a file included in the write request.
  • the first file included in the write request is acquired.
  • the first file may also be a file already written in the file system.
  • one or more files in the file system are detected in the offline mode, and respective characteristic information of the one or more files are respectively determined.
  • the characteristic information of the first file is determined according to the sampling data of the first file.
  • the sampled data is partial data obtained from the data of the first file through a sampling algorithm.
  • determining the characteristic information of the first file and the method for obtaining sampled data in the embodiment in FIG. 6 and FIG. 8 and details will not be repeated here. It can be understood that acquiring the characteristic information of the first file by sampling is beneficial to reduce data processing overhead.
  • the third file is a file in the index directory, and the third file is associated with the storage address of the second file in the second storage space, which means that the second file pointed to by the third file has been written into the disk, and is the system files that already exist in .
  • indexing the directory it can be found whether a file identical to the first file already exists in the system.
  • the first file when the third file does not exist in the index directory, the first file is stored in the second storage space, and a fourth file is added to the index directory, and the file name of the fourth file is the first file
  • the characteristic information of the fourth file is associated with the storage address of the first file.
  • the feature information of the first file is calculated as fingerprint D4.
  • the fingerprint D4 By searching the index directory as shown in FIG. 5 , it is determined that the fingerprint D4 does not exist in the index directory. It means that the same file as the first file does not exist in the system, and the first file is a non-duplicate file.
  • a fourth file is inserted into the index directory as shown in FIG. 5 , the file name of the fourth file is fingerprint D4, and the fourth file points to the storage address of the first file in the second storage space.
  • the link identifier of the first file is associated with the second file, and the first file is deleted from the first storage space.
  • the link identifier of the first file is used to obtain the first file.
  • An embodiment of the present application provides a file search method.
  • the file search method acquires a first file and determines the characteristic information of the first file; according to the characteristic information of the first file, it is determined whether there is a third file in the index directory, and the third file
  • the file name of is the same as the feature information of the first file.
  • the method of searching through the index directory is conducive to simplifying the process of searching for files. And, when the first file is a duplicate file, and after the duplicate file is deleted, if you need to access the corresponding file, you can access the second file (the same file as the first file) linked to the feature information of the first file, so that Keep normal file access.
  • the device or device provided by the embodiment of the present application may include a hardware structure and/or a software module, and may be realized in the form of a hardware structure, a software module, or a hardware structure plus a software module the above functions. Whether one of the above-mentioned functions is executed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.
  • the division of modules in the embodiments of the present application is schematic, and is only a logical function division. There may be other division methods in actual implementation.
  • each functional module in each embodiment of the present application can be integrated into a processing In the controller, it can also be physically present separately, or two or more modules can be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
  • FIG. 15 is a device 1500 provided by an embodiment of the present application, which is used to implement the file deduplication function or file search function in the above method embodiments.
  • the device may be a terminal device or a device deployed on the cloud, or a device in the terminal device or the device deployed on the cloud, or a device that can be matched and used with the terminal device or the device deployed on the cloud.
  • the device may be a system on a chip.
  • the device 1500 includes at least one processor 1502, configured to implement the functions of the terminal device or the device deployed on the cloud in the file deduplication method or the file search method provided in the embodiment of the present application.
  • the processor 1502 may store the first file in the first storage space in response to the write request.
  • Device 1500 may also include at least one memory 1503 for storing program instructions and/or data.
  • the memory 1503 is coupled to the processor 1502 .
  • the coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
  • Processor 1502 may cooperate with memory 1503 .
  • Processor 1502 may execute program instructions stored in memory 1503 . At least one of the at least one memory may be included in the processor.
  • the device 1500 may further include a communication interface 1501, which may be, for example, a transceiver, an interface, a bus, a circuit, or a device capable of implementing a sending and receiving function.
  • the communication interface 1501 is used to communicate with other devices through a transmission medium, so that the devices used in the device 1500 can communicate with other devices.
  • the other device may be a terminal.
  • the processor 1502 uses the communication interface 1501 to send and receive data, and is used to implement the method executed by the terminal device or the device deployed on the cloud described in the embodiment corresponding to FIG. 13 or FIG. 14 .
  • the embodiment of the present application does not limit the specific connection medium among the communication interface 1501, the processor 1502, and the memory 1503. In the embodiment of the present application, in FIG.
  • the bus 15 is represented by a thick line in FIG. 15, and the connection mode between other components is only for schematic illustration. , is not limited.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 15 , but it does not mean that there is only one bus or one type of bus.
  • the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, and may implement or Execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the memory may be a non-volatile memory, such as a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD), etc., and may also be a volatile memory (volatile memory), such as Random-access memory (RAM).
  • a memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
  • Figure 16 shows a file deduplication device 1600 provided by the embodiment of the present application.
  • the file deduplication device can be a terminal device or a device deployed on the cloud, or it can be a terminal device or a device deployed on the cloud.
  • the file deduplication device may include a module corresponding to one-to-one execution of the methods/operations/steps/actions described in the example corresponding to Figure 13, and the module may be a hardware circuit, software, or Hardware circuit combined with software implementation.
  • the device may include a file operation module 1601 , a file cache module 1602 , and an information processing module 1603 .
  • the file operation module 1601 is configured to obtain a write request, where the write request includes the first file.
  • the file caching module 1602 is configured to store the first file in response to the write request, and the first file is stored in the first storage space.
  • the information processing module 1603 is configured to determine whether there is a second file in the second storage space, the second file is the same as the first file, and the second storage space and the first storage space are located at different layers of the storage system.
  • the file caching module 1602 is also used for:
  • the first file is stored in the second storage space.
  • the file caching module 1602 is also used for:
  • the first file is stored in the second storage space.
  • the information processing module 1603 is further configured to associate the link identifier of the first file with the second file if the second file exists, and the link identifier of the first file is used to obtain the first file;
  • the file caching module 1602 is further configured to delete the first file from the first storage space.
  • the information processing module 1603 is also used to:
  • the sampling data of the first file the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
  • the information processing module 1603 is also used to:
  • the file name of the third file is the same as the feature information of the first file, and the third file is related to the storage address of the second file in the second storage space couplet.
  • the file deduplication apparatus 1600 further includes a generation module 1604, and the generation module 1604 is used to generate prompt information, and the prompt information includes one or more of the following: prompt of deleted duplicate files, storage capacity released by deleting duplicate files, Delete the number of duplicate files, the file type of duplicate files.
  • the generation module 1604 is also used to generate a record log, which includes one or more of the following: data in the index directory, storage location corresponding to the first file identifier, data in the first storage space, duplicate deletion The storage capacity freed by the files, the number of deduplicated files, the file types of deduplicated files.
  • the file deduplication apparatus 1600 further includes an execution module 1605, and the execution module 1605 is configured to obtain an instruction, the instruction indicates enabling the file deduplication function; in response to the instruction, perform an operation of obtaining a write request.
  • Figure 17 shows a file search device 1700 provided by the embodiment of the present application.
  • the file search device may be a terminal device or a device deployed on the cloud, or a device in a terminal device or a device deployed on the cloud. Or it is a device that can be matched with terminal devices or devices deployed on the cloud.
  • the file search device may include a one-to-one corresponding module for executing the methods/operations/steps/actions described in the example corresponding to Figure 14, and the module may be a hardware circuit, software, or hardware Circuit combined with software implementation.
  • the device may include a file operation module 1701 and an information processing module 1702 . Exemplarily, the file operation module 1701 is used to acquire the first file and determine the characteristic information of the first file.
  • the information processing module 1702 is configured to determine feature information of the first file.
  • the information processing module 1702 is also used to determine whether there is a third file in the index directory according to the feature information of the first file.
  • the file name of the third file is the same as the feature information of the first file.
  • the storage addresses of the two storage spaces are associated.
  • the information processing module 1702 is used to determine the feature information of the first file, including:
  • the sampling data of the first file the feature information of the first file is determined; the sampling data is part of the data obtained from the data of the first file through a sampling algorithm.
  • the file search apparatus 1700 further includes a file cache module 1703, and the file cache module 1703 is configured to store the first file in the second storage space and store the first file in the index directory when the third file does not exist in the index directory.
  • a fourth file is added, the file name of the fourth file is the feature information of the first file, and the fourth file is associated with the storage address of the first file.
  • the information processing module 1702 is further configured to associate the link identifier of the first file with the second file when the third file exists in the index directory, and the link identifier of the first file is used to obtain the first file;
  • the file caching module 1703 is further configured to delete the first file from the first storage space.
  • the technical solutions provided by the embodiments of the present application may be fully or partially implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general computer, a special computer, a computer network, a network device, a terminal device or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital video disc (digital video disc, DVD)), or a semiconductor medium.
  • the various embodiments may refer to each other, for example, the methods and/or terms between the method embodiments may refer to each other, such as the functions and/or terms between the device embodiments Or terms may refer to each other, for example, functions and/or terms between the apparatus embodiment and the method embodiment may refer to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供一种文件去重方法、装置和设备,该方法在文件写入过程中,响应于写请求,暂存写请求中的文件在第一存储空间中,并将写请求中的文件与第二存储空间中的文件进行对比,从而判断写请求中的文件是否为重复文件。通过该方法,能够实现在文件写入过程中自动去除重复文件,减少存储空间占用;用户无需主动发起文件去重请求,降低性能开销。

Description

一种文件去重方法、装置和设备 技术领域
本申请涉及通信技术领域,尤其涉及一种文件去重方法、装置和设备。
背景技术
终端设备存储空间消耗快,存储空间不足是用户换机关键因素之一。随着移动互联网及智能终端等设备的普遍应用,社交过程所产生的重复文件越来越多,占用大量空间。为了降低重复文件对存储空间的占用,当前已经有一些用于文件去重的应用(例如各种手机清理工具),手机清理工具可以提供用户入口,用户手动启动后可扫描及识别出终端设备中的重复文件,获得扫描结果,并将扫描结果提供给用户;用户通过手动操作,逐个确认及删除重复文件。但是,采用这种方式扫描时间长,而且需要用户逐个选择及清除重复文件,耗时较长;并且由于每个文件可能都是对应到一个社交软件交互窗口,直接删除重复文件后可能导致交互窗口显示异常或对话不可用。因此,如何在用户和应用无感的情况下有效地去除重复文件成为待解决的问题。
发明内容
本申请实施例提供一种文件去重方法、装置和设备,该方法能够自动去除重复文件,减少存储空间占用;并且对应用无感,也无需用户进行复杂的操作,降低***的处理开销。
第一方面,本申请实施例提供一种文件去重方法,该文件去重方法由终端设备或者部署在云上的设备来实现。其中,终端设备或者部署在云上的设备获取写请求,写请求中包括第一文件;响应于写请求,存储第一文件,第一文件存储于第一存储空间;确定第二存储空间中是否存在第二文件,第二文件与第一文件相同,第二存储空间与第一存储空间位于存储***的不同层。例如,第一存储空间位于内存空间,第二存储空间位于外存空间(例如磁盘)。该方法中,在获取到写请求时,即将写请求中包含的第一文件存储于一个独立的存储空间(第一存储空间)中,并判断第二存储空间中已经存储的现有文件中是否存在与第一文件相同的文件(即判断是否存在重复文件)。该方法在获取写请求的同时进行重复检查,实现在线去除重复文件(也称为在线文件去重),可以做到用户和应用无感;且该方法(在线文件去重)不需要像现有技术中将已写入外存空间(例如磁盘)的文件重新读到缓存中再进行去重操作,能够减少重复写入硬盘的次数,避免重复文件所产生的硬盘写入的开销;且该方法可以在用户开启文件去重功能之后就在每次收到写请求时均进行重复检查,避免了用户重复手动进行去重操作,能够提升用户体验。
在一种可能的设计中,第一方面提供的文件去重方法可以应用于终端设备的应用程序执行写操作的场景中。其中,终端设备获取应用程序的写请求,写请求中包括第一文件;响应于写请求,将第一文件存储于第一存储空间中;确定第二存储空间中是否存在第二文件,第二文件与第一文件相同,第二存储空间与第一存储空间位于存储***的不同层。通过该方法,终端设备能够实现应用程序执行的写操作过程中的在线文件去重,减少存储空间占用。并且对于终端设备来说,文件去重过程对应用无感,无需终端设备内部的生态配合,也无需用户进行复杂的操作,***开销较低。
在一种可能的设计中,在不存在第二文件的情况下,将第一文件存储于第三存储空间, 并在第三存储空间内对第一文件执行缓存区操作;执行完缓存区操作后,将第一文件存储于第二存储空间。通过该方法,在现有的文件缓存空间(第三存储空间)中新建一个第一存储空间,用于执行文件查重操作,从而实现在线文件去重。
在一种可能的设计中,在不存在第二文件的情况下,在第一存储空间内对第一文件执行缓存区操作;执行完缓存区操作后,将第一文件存储于第二存储空间。通过该方法,第一存储空间用于执行文件查重操作,从而实现在线文件去重;并且第一存储空间与现有的文件缓存空间兼容,有利于工程实现;将设置标志位、写检查与空间分配等缓存区操作简化并推迟,有利于降低***操作开销。
在一种可能的设计中,在存在第二文件的情况下,将第一文件的链接标识与第二文件相关联,第一文件的链接标识用于获取第一文件,再从第一存储空间中删除第一文件。通过该方法,当存在重复文件时,***可以直接从缓存中删除重复文件而不会产生额外的数据拷贝,有利于降低***开销;并且删除重复文件后,该文件的链接标识与已存储在***中的文件相关联,使得该文件也可以被查找。
在一种可能的设计中,第二文件与第一文件相同,表示第二文件的特征信息与第一文件的特征信息相同。通过该方法,采用特征信息的比对可以确定写请求中的文件是否为重复文件。
在一种可能的设计中,根据第一文件的抽样数据,确定第一文件的特征信息。其中,抽样数据是通过采样算法从第一文件的数据中获取的部分数据。通过该方法,仅抽样少量文件数据用于获取特征信息,有利于降低***开销。
在一种可能的设计中,根据第一文件的抽样数据和文件信息,确定第一文件的特征信息。其中,文件信息包括文件类型、文件大小等信息。通过该方法,将抽样数据和文件信息相结合,能够更准确地体现出文件的特征信息以及特征信息的唯一性。
在一种可能的设计中,特征信息包括指纹信息和/或文件标识ID。其中,文件的特征信息具有唯一性,对于每一个文件来说该文件的特征信息是唯一的。
在一种可能的设计中,响应于关闭第一文件的指令,确定第一文件的特征信息。通过该方法,确定文件的特征信息的过程可以是在写操作完成后的文件关闭操作过程中来执行,有利于降低***开销。
在一种可能的设计中,确定第一文件的特征信息;根据第一文件的特征信息,通过索引目录,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件的存储地址相关联。通过该方法,基于查找本申请实施例提供的一种索引目录,可以判断第一文件是否为重复文件,有利于更有效地去除重复文件。
在一种可能的设计中,在索引目录中不存在第三文件的情况下,在索引目录中增加第四文件,第四文件的文件名为第一文件的特征信息,第四文件与第一文件的存储地址相关联。通过该方法,当写请求中的文件不是重复文件时,可以更新索引目录,使得索引目录包括已写入磁盘的文件,有利于更准确地判断是否***中是否存在重复文件。
在一种可能的设计中,生成提示信息,提示信息包括以下一种或多种:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型。通过该方法,可以向用户显性地展示文件去重的性能,增强用户体验。
在一种可能的设计中,生成记录日志,记录日志包括以下一项或多项内容:索引目录中的数据、第一文件标识对应的存储位置、第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。通过该方法,可以对外提供调试 应用程序接口API或者调试日志,有利于用户进行***调试。
在一种可能的设计中,获取指令,该指令指示开启文件去重功能;响应于该指令,执行获取写请求的操作。通过该方法,可以向用户提供文件去重功能开关,用户只需打开开关即可实现自动的文件去重,用户无需参与文件去重过程,优化了用户体验。
在一种可能的设计中,执行第一方面的文件去重方法的整体过程可以嵌入文件访问过程的主流程。通过该方法,无需扩展独立的文件去重线程,而是嵌入现有的线程中,有利于降低开销。
第二方面,本申请实施例提供一种文件查找方法,该文件查找方法由终端设备或者部署在云上的设备来实现。其中,终端设备或者部署在云上的设备获取第一文件,并确定第一文件的特征信息;根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件的存储地址相关联,第二文件存储于第二存储空间。该方法中,通过文件的形式构建索引目录,索引目录中的第三文件与第二存储空间中存储的第二文件一一对应,以第二文件的特征信息作为第三文件的文件名,将第三文件与第二文件的存储地址相关联,例如,可以在第三文件中存储第二文件的存储地址等。该方法中以文件的形式存储的索引目录所需的存储空间小,极大的减少了存储开销;且该方法下的索引目录较现有技术的查找速度更快,能够极大的提升***性能。
在一种可能的设计中,根据第一文件的抽样数据,确定第一文件的特征信息;其中,抽样数据是通过采样算法从第一文件的数据中获取的部分数据。通过该方法,仅抽样少量文件数据用于获取特征信息,有利于降低***开销。
在一种可能的设计中,在索引目录中不存在第三文件的情况下,将第一文件存储于第二存储空间,并在索引目录中增加第四文件,第四文件的文件名为第一文件的特征信息,第四文件与第一文件的存储地址相关联。通过该方法,当写请求中的文件不是重复文件时,可以更新索引目录,使得索引目录包括已写入磁盘的文件,有利于更准确地判断***中是否存在重复文件。
在一种可能的设计中,在索引目录中存在第三文件的情况下,将第一文件的链接标识与第二文件的存储地址相关联,第一文件的链接标识用于获取第一文件。通过该方法,当第一文件为重复文件,且重复文件被删除后,若需要访问对应的文件,可以访问到第一文件的链接标识所关联的第二文件的存储地址,从而保持正常的文件访问。
在一种可能的设计中,生成提示信息,提示信息包括以下一种或多种:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型。通过该方法,可以向用户显性地展示文件去重的性能,增强用户体验。
在一种可能的设计中,生成记录日志,记录日志包括以下一项或多项内容:索引目录中的数据、第一文件标识对应的存储位置、第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。通过该方法,可以对外提供调试应用程序接口API或者调试日志,有利于用户进行***调试。
在一种可能的设计中,获取指令,该指令指示开启文件去重功能;响应于该指令,执行获取写请求的操作。通过该方法,可以向用户提供文件去重功能开关,用户只需打开开关即可实现自动的文件去重,用户无需参与文件去重过程,优化了用户体验。
在一种可能的设计中,执行第二方面的文件查找方法的整体过程可以嵌入文件访问过程的主流程。通过该方法,无需扩展独立的文件去重线程,而是嵌入现有的线程中,有利于降低开销。
第三方面,本申请实施例提供一种文件去重装置,该文件去重装置包括文件操作模块、文件缓存模块和信息处理模块。其中,文件操作模块用于获取写请求,写请求中包括第一文件;文件缓存模块用于响应于写请求,存储第一文件,第一文件存储于第一存储空间;信息处理模块用于确定第二存储空间中是否存在第二文件,第二文件与第一文件相同,第二存储空间与第一存储空间位于存储***的不同层。
在一种可能的设计中,文件缓存模块还用于在不存在第二文件的情况下,将第一文件存储于第三存储空间,并在第三存储空间内对第一文件执行缓存区操作;执行完缓存区操作后,将第一文件存储于第二存储空间。
在一种可能的设计中,文件缓存模块还用于在不存在第二文件的情况下,在第二存储空间内对第一文件执行缓存区操作;执行完缓存区操作后,将第一文件存储于第二存储空间。
在一种可能的设计中,信息处理模块还用于在存在第二文件的情况下,将第一文件的链接标识与第二文件相关联,第一文件的链接标识用于获取第一文件;文件缓存模块还用于从第一存储空间中删除第一文件。
在一种可能的设计中,特征信息包括指纹信息和/或文件ID。其中,文件的特征信息具有唯一性,对于每一个文件来说该文件的特征信息是唯一的。
在一种可能的设计中,信息处理模块还用于根据第一文件的抽样数据,确定第一文件的特征信息,抽样数据是通过采样算法从第一文件的数据中获取的部分数据。
在一种可能的设计中,信息处理模块还用于确定第一文件的特征信息;根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件在第二存储空间中的存储地址相关联。
在一种可能的设计中,该文件去重装置还包括提示模块,提示模块用于生成提示信息,提示信息包括以下一种或多种:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型。
在一种可能的设计中,该文件去重装置还包括生成模块,生成模块用于生成记录日志,记录日志包括以下一项或多项内容:索引目录中的数据、第一文件标识对应的存储位置、第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。
在一种可能的设计中,该文件去重装置还包括执行模块,所述执行模块用于获取指令,该指令指示开启文件去重功能;响应于该指令,执行获取写请求的操作。
上述第三方面及其任意一种可能的设计中所提供的实现文件去重方法的模块,也能实现第一方面提供的文件去重方法所具备的有益效果。
第四方面,本申请实施例提供一种文件查找装置,该文件查找装置包括文件操作模块和信息处理模块。其中,文件操作模块用于获取第一文件,信息处理模块用于确定第一文件的特征信息;文件操作模块还用于根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件的在第二存储空间的存储地址相关联。
在一种可能的设计中,信息处理模块用于确定第一文件的特征信息,包括:
根据第一文件的抽样数据,确定第一文件的特征信息;抽样数据是通过采样算法从第一文件的数据中获取的部分数据。
在一种可能的设计中,该文件查找装置还包括文件缓存模块,文件缓存模块用于在索引目录中不存在第三文件的情况下,将第一文件存储于第二存储空间,并在索引目录中增加第 四文件,第四文件的文件名为第一文件的特征信息,第四文件与第一文件的存储地址相关联。
在一种可能的设计中,信息处理模块还用于在索引目录中存在第三文件的情况下,将第一文件的链接标识与第二文件相关联,第一文件的链接标识用于获取第一文件;文件缓存模块还用于从第一存储空间中删除第一文件。
上述第四方面及其任意一种可能的设计中所提供的实现文件查找方法的模块,也能实现第二方面提供的文件查找方法所具备的有益效果。
第五方面,本申请实施例提供一种设备,该设备可以是终端设备或者部署在云上的设备。其中,该设备包括一个或多个处理器和存储器;存储器与一个或多个处理器耦合,存储器存储有计算机程序,一个或多个处理器执行计算机程序时,该设备执行如下操作:
获取写请求,写请求中包括第一文件;
响应于写请求,存储第一文件,第一文件存储于第一存储空间;
确定第二存储空间中是否存在第二文件,第二文件与第一文件相同,第一存储空间与第二存储空间位于存储***的不同层。
关于第一存储空间、第二存储空间、第一文件的抽样数据、第一文件的特征信息、第一文件的链接标识与第二文件相关联、生成提示信息、生成记录日志等的介绍请参见第一方面中对应的描述,此处不再赘述。
第六方面,本申请实施例提供一种设备,该设备可以是终端设备或者部署在云上的设备。其中,该设备包括一个或多个处理器和存储器;存储器与一个或多个处理器耦合,存储器存储有计算机程序,一个或多个处理器执行计算机程序时,该设备执行如下操作:
获取第一文件,并确定第一文件的特征信息;
根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件在第二存储空间的存储地址相关联。
关于第一文件的特征信息、第三文件、第一文件的抽样数据、第一文件的链接标识与第二文件相关联等的介绍请参见第二方面中对应的描述,此处不再赘述。
第七方面,本申请实施例提供一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序被处理器执行以实现上述第一方面或第二方面及其可能实现的方式中的任一项所述的方法。
第八方面,本申请实施例提供一种芯片***,该芯片***包括处理器,还可以包括存储器,用于实现上述第一方面或第二方面所述的方法中终端设备或部署在云上的设备的功能。该芯片***可以由芯片构成,也可以包含芯片和其他分立器件。
第九方面,本申请实施例中提供一种计算机程序产品,包括指令,当所述指令在计算机上运行时,使得计算机执行第一方面或第二方面及其可能实现的方式中的任一项所述的方法。
附图说明
图1a为一种用户手动操作执行文件去重功能的流程示意图;
图1b为一种用户手动操作执行文件去重功能后文件异常的示意图;
图2为本申请实施例提供的一种终端设备的硬件结构示意图;
图3为本申请实施例提供的一种终端设备的软件结构示意图;
图4a为本申请实施例提供的一种实现文件去重方法的模块化流程图;
图4b为本申请实施例提供的另一种实现文件去重方法的模块化流程图;
图5为本申请实施例提供的一种索引目录的示意图;
图6为本申请实施例提供的一种在安卓***终端中面向应用程序实现文件去重功能的流程示意图;
图7a为本申请实施例提供的一种在第一存储空间中执行写操作的流程的示意图;
图7b为本申请实施例提供的另一种在第一存储空间中执行写操作的流程的示意图;
图8为本申请实施例提供的一种根据抽样数据确定特征信息的示意图;
图9为本申请实施例提供的一种将文件的链接标识与相同文件相关联的示意图;
图10为本申请实施例提供的一种链接对应关系的示意图;
图11为本申请实施例提供的一种输出文件访问授权界面的示意图;
图12为本申请实施例提供的一种外部设备调用文件去重功能的示意图;
图13为本申请实施例提供的一种文件去重方法的流程示意图;
图14为本申请实施例提供的一种文件查找方法的流程示意图;
图15为本申请实施例提供的一种设备的示意图;
图16为本申请实施例提供的一种文件去重装置的示意图;
图17为本申请实施例提供的一种文件查找装置的示意图。
具体实施方式
在本申请实施例中,“/”可以表示前后关联的对象是一种“或”的关系,例如,A/B可以表示A或B;“和/或”可以用于描述关联对象存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。为了便于描述本申请实施例的技术方案,在本申请实施例中,可以采用“第一”、“第二”等字样对功能相同或相似的技术特征进行区分。该“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。在本申请实施例中,“示例性的”或者“例如”等词用于表示例子、例证或说明,被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念,便于理解。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
终端设备存储空间消耗快,存储空间不足是用户换机关键因素之一。随着移动互联网及智能终端等设备的普遍应用,社交过程所产生的重复文件越来越多,占用大量空间。例如,从部分调研数据看,在部分用户保有清理文件的习惯的前提下,超过1/4的用户的重复文件所占容量大于2千兆字节(Gigabyte,GB);调研对象中重复文件最高达16.49GB,甚至更多。
因此,为了降低重复文件对存储空间的占用,一方面,当前已经有一些用于文件去重的应用(例如各种手机清理工具)。其中,手机清理工具可以提供用户入口,用户手动启动后可扫描及识别出终端设备中的重复文件,获得扫描结果,并将扫描结果提供给用户。用户通过手动操作,逐个确认及删除重复文件。例如,图1a示出了一种用户手动操作执行文件去重功能时的流程。其中,终端设备的显示界面将显示目前***已被占用的存储空间、垃圾文件、重复文件等信息。用户可以手动选择清理重复文件,终端设备的显示界面将显示多个重复文件和文件的来源,如图1a所示。但是,采用这种方式扫描时间长,而且需要用户逐个选择及清除重复文件,耗时较长;并且由于每个文件可能都是对应到一个社交软件交互窗口,直接删除重复文件后可能导致交互窗口显示异常或对话不可用。例如,图1b示出了一种用户手动 操作执行文件去重功能后文件异常的情况。由于用户在清理重复文件时直接删除重复文件,当用户再次打开社交软件交互窗口查找图片时,导致交互窗口无法正常显示原图片。
另一方面,目前还存在通过提供应用程序接口(application interface,API)模式实现文件去重的方案。例如,苹果文件***(Apple file system,APFS)具有写时复制功能。若用户操作为复制存储在APFS上的文件,并将其复制到同一APFS文件***上的另一个文件夹,则APFS将创建一个标记为“写时复制”的新文件,并指向原始文件的所有存储。但是,这种文件去重方案中APFS不会尝试确定现有文件或从外部源复制的文件是否与文件***上已有的任何文件匹配。并且该方案需要提供API,需应用生态配合修改,导致应用场景极大的受限。
因此,如何在用户和应用无感的情况下有效地去除重复文件成为待解决的问题。
为了解决上述问题,本申请实施例提供一种文件去重方法,该文件去重方法能够有效去除重复文件,减少存储空间占用;并且该文件去重方法应用于终端设备中时,对终端设备中的应用无感,也无需用户进行复杂的操作,降低***的处理开销。
其中,本申请实施例提供的文件去重方法可以应用于终端设备,或者部署在云上的设备中。可选的,该文件去重方法还可以应用于由终端设备控制的对云上的文件进行文件去重的场景中。下面首先介绍本申请以下实施例中提供的示例性终端设备。
图2示出了终端设备100的结构示意图。终端设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对终端设备100的具体限定。在本申请另一些实施例中,终端设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了***的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路 (inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等***器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实现终端设备100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现终端设备100的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为终端设备100充电,也可以用于终端设备100与***设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他终端设备,例如AR设备等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端设备100的结构限定。在本申请另一些实施例中,终端设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
终端设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备100可以包括1个或N个显示屏194,N为大于1的正整数。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作***,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端设备100的各种功能应用以及数据处理。
基于图2所示本申请实施例的终端设备100的硬件结构示意图,下面介绍本申请实施例的终端设备100的软件结构框图,如图3所示。
终端设备100的软件***可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android***为例,示例性说明终端设备100的软件结构。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android***分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和***库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图3所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,短信息和多屏代理等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图3所示,应用程序框架层可以包括窗口管理器,内容提供器,视图***,电话管理器,资源管理器,通知管理器和多屏框架等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图***包括可视控件,例如显示文字的控件,显示图片的控件等。视图***可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供终端设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在***顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,终端设备振动,指示灯闪烁等。
多屏框架用于将终端设备100与大屏设备建立连接的各个事件通知到应用程序层的“多屏代理”,还可以用于响应于应用程序层的“多屏代理”的指令辅助该“多屏代理”获取数据信息。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓***的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
***库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(media libraries),三维图形处理库,2D图形引擎等。
表面管理器用于对显示子***进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
图4a为本申请实施例提供的一种实现文件去重方法的模块化流程图。图4a以终端设备内部的模块化流程为例进行描述。可以理解,当本申请实施例提供的文件去重方法应用于云上,或者应用于终端与云上的交互场景中时,也存在类似于图4a的模块化流程。其中,终端设备中的现有的文件访问流程包括:当应用程序发起文件访问请求时,***通过写操作(write)将文件访问请求中的文件直接写入VFS中的文件缓存,然后将文件访问请求中的文件写入文件***。进一步,还可以将文件写入驱动和闪存(flash)中。也就是说,现有的文件访问流程中直接通过写操作将文件写入内存空间和外存空间,不能实现对重复文件的判断和在线文件去重。图4a所示的实现文件去重方法的模块化流程主要包括文件操作模块、文件缓存模块、信息处理模块、文件索引模块和VFS等。其中,区别于现有的文件访问流程,图4a所示的文件缓存模块为内存空间中新建的缓存模块,用于拦截***的写操作,缓存写操作中的文件;并结合信息处理模块和文件索引模块,实现对缓存的文件计算特征信息,根据特征信息判断文件是否为重复文件,对重复文件进行在线去重。在图4a所示的模块化流程中,当文件缓存模块、信息处理模块和文件索引模块执行上述操作后,再将非重复文件继续写入VFS,并写入文件***/块设备层/驱动/闪存中,完成文件访问流程。采用图4a所示的文件去重流程,需要在现有的内存空间中新增一个缓存空间,用于实现在线文件去重。应注意,图4a所示的文件缓存模块主要用于执行文件对比和文件去重的操作,文件访问流程中的缓存区操作(例如设置标志位、写检查与空间分配等操作)仍然由VFS中的文件缓存来执行。
图4b为本申请实施例提供的另一种实现文件去重方法的模块化流程图。图4b以终端设备内部的模块化流程为例进行描述。可以理解,当本申请实施例提供的文件去重方法应用于云上,或者应用于终端与云上的交互场景中时,也存在类似于图4b的模块化流程。其中,区别于现有的文件访问流程,图4b所示的文件缓存模块对原有的文件缓存进行了增强,例如新增了对缓存的文件计算特征信息、文件比对和文件去重等功能,从而实现在线文件去重;文件访问流程中的缓存区操作(例如设置标志位、写检查与空间分配等操作)也由图4b所示的文件缓存模块来执行,但是执行的顺序与现有文件访问流程相比有延迟。也就是说,图4b所示的VFS中的文件缓存将不执行写操作(例如不再执行缓冲区操作)。
综上所述,在图4a或图4b所示的模块化流程中,本申请实施例提供的文件去重方法流程可以是嵌入现有的文件访问流程中,不需要独立的后台线程,有利于降低***的写开销。并且,本申请实施例新建了文件缓存模块,用于实现在线文件去重。
为了便于理解,下面对本申请实施例涉及的相关名词进行介绍。
1、文件操作模块:用于拦截应用程序的文件访问请求,调用文件缓存模块缓存数据,调用信息处理模块识别重复文件,结合文件缓存模块和信息处理模块去除重复文件或保存非重复文件。
2、文件缓存模块:用于构建独立的自建文件缓存空间,并通过自建文件缓存空间缓存拦 截的文件。例如,采用图4a所示的方式在现有的内存空间中新建一个缓存空间,缓存并存放拦截的文件数据;或者图4b所示的方式采用自建文件缓存空间替换VFS的文件缓存中的文件缓存,用于存放拦截的文件数据。
3、信息处理模块:用于从文件缓存模块中获取文件数据并计算文件的特征信息,还用于向文件索引模块发起特征信息检索请求或新增特征信息的请求。
4、文件索引模块:用于构建及维护索引目录,在索引目录中检索目标特征信息。其中,索引目录可以视为一种类数据库,该索引目录不会占用内存。
5、文件目录:用于记录文件***中存储的文件。文件目录中的目录项包括但不限于文件名,文件的链接标识、文件的重复次数等。
6、文件的特征信息:用于指示每一个文件具有唯一性的信息。文件的特征信息可以包括但不限于指纹,文件ID等。例如,对于两个文件(文件1和文件2),当文件1和文件2的内容不相同时,文件1的指纹1和文件2的指纹2是不相同的,即指纹1用于标识文件1,指纹2用于标识文件2。可选的,当文件1和文件2的内容相同时(包括但不限于文件1和文件2的内容相同且文件名相同,文件1和文件2的内容相同且文件名不相同),文件1和文件2的指纹相同(例如都为指纹1)。
7、索引目录:一种数据存取模式,在***中创建一个目录作为索引目录。例如,本申请实施例中的索引目录可以是一种特征信息的索引表。其中,该索引目录是由文件索引模块采取基于文件目录的索引方式构建与维护的。索引目录包括一条或多条特征信息索引,例如包括多条指纹索引。每一条指纹索引对应一个索引目录中的文件,文件名是指纹,文件对应的链接标识(inode)表示指纹对应的文件的inode。例如,图5为本申请实施例提供的一种索引目录的示意图。其中,***中包括文件A、文件B和文件C,文件A的链接标识为inode1,文件B的链接标识为inode2,文件C的链接标识为inode3。在构建索引目录时,针对文件A,首先计算文件A的特征信息(即计算文件A的指纹),生成指纹A1,并且指纹A1指向文件A的链接标识inode1,则生成索引目录中的一条指纹索引:指纹A1-inode1。类似的,针对文件B和文件C等文件,生成索引目录中的其他指纹索引:指纹B2-inode2、指纹C3-inode3等,如图5所示。其中,通过将文件指纹和文件的链接标识相关联,使得在查找索引目录时,可以直接通过链接标识获取文件所在的位置,从而有利于实现更高效的文件查找。
下面将结合图4a和图4b,以安卓***为例,详细描述该文件访问方法应用于安卓***终端设备时的应用实施例。
图6为本申请实施例提供的一种在采用安卓***的终端设备中面向应用程序实现文件去重功能的流程示意图。该场景中,终端设备中的应用程序在请求写入文件时,终端设备可以在文件写入过程中执行文件去重方法,具体流程由本申请实施例提供的文件操作模块、信息处理模块、文件缓存模块和文件索引模块之间的交互实现,包括以下步骤:
1、在应用程序请求写入文件时,文件操作模块获取写请求,写请求中包括第一文件。文件操作模块调用文件缓存模块,将第一文件存储于第一存储空间。
一种实现方式中,在图4a所示模块化流程中,当文件操作模块检测到应用程序的写请求时,可以拦截该写请求,并将写请求中的第一文件缓存至新增的文件缓存模块(第一存储空间)。在文件缓存模块中执行计算特征信息、重复文件对比、去除重复文件等操作,如图7a所示。当文件缓存模块执行完文件去重操作后,再采用标准write函数***调用,将写请求中的文件缓存至VFS(第三存储空间),在VFS中继续执行缓存区操作。其中,图7a中的缓存区操作是指在文件缓存模块中未执行的写请求操作,包括但不限于设置标志位、写检查与空 间分配、数据写回等操作。图7a中的缓存区操作与现有写请求中的缓存区操作是相同的,例如,将一个文件分为多页(page),对每一个page执行设置标志位、写检查与空间分配、数据写回等操作。当同一文件的多个page都被执行上述缓存区操作后,该文件将被写入磁盘,同时***将释放该文件占用的内存。可见,图7a所示的流程中采用两次缓存串行模式,在现有缓存中嵌入拦截缓存、计算及去重功能;根据文件的特征信息,对重复文件执行去重操作,不再向***继续写入重复文件,直接从内存中丢弃重复文件;对非重复文件继续写入***。
一种实现方式中,在图4b所示模块化流程中,当文件操作模块检测到应用程序的写请求时,文件操作模块自定义***调用缓存(caching)函数,首先构建自建文件缓存(第一存储空间);通过文件缓存模块基于从用户复制(copy_from_user)函数将拦截的第一文件一次性缓存至自建文件缓存,如图7b所示。其中,一次性缓存是指将同一文件的page全部都缓存至自建文件缓存,而不是按照每一个page逐个缓存。在一次性缓存的实现方式中,缓存区操作将被推迟和简化。例如,针对M个page,缓存区操作包括设置标志位M次、写检查与空间分配1次、数据写回N次。其中,在图7b所示的文件缓存模块中可以计算已缓存的文件的特征信息,从而判断已缓存的文件是否为重复文件。若为重复文件,则从内存中丢弃重复文件;若为非重复文件,则继续写入***。可见,图7b所示的流程中构建独立的自建文件缓存来缓存文件数据,一次实现计算文件特征信息以及向下写入缓存,实现整个去重操作只有一次数据拷贝;同时将缓存操作优化推迟,重复数据最终将从内存中丢弃,不产生外存写操作,实现在文件访问的路径中完成低开销的文件去重。
可选的,图7b所示的实现方式中,缓存区操作、计算特征信息和去除重复文件等操作可以是在关闭操作的过程中执行。其中,关闭操作为写操作之后执行的文件操作,当写操作(例如文件写入自建文件缓存)执行完成后,***可以执行关闭操作,在执行关闭操作的过程中,继续执行图7b所示的缓存区操作、计算特征信息和去除重复文件等操作,从而有利于降低***写操作开销。
2、信息处理模块通过抽样算法确定第一文件的特征信息。具体的,信息处理模块采用一种抽样哈希的算法,获取第一文件的抽样数据,并根据第一文件的抽样数据确定第一文件的特征信息。可见,信息处理模块仅需要抽样少量文件数据用于获取特征信息,有利于降低***开销。可选的,信息处理模块还可以根据第一文件的抽样数据和第一文件的文件信息,确定第一文件的特征信息。其中,特征信息可以包括但不限于指纹信息、文件ID等,文件信息可以包括但不限于文件类型、文件大小等。可以理解,结合第一文件的抽样数据和第一文件的文件信息计算确定的第一文件的特征信息更能够体现第一文件的唯一性。
例如,图8为本申请实施例提供的一种抽样计算特征信息的示意图。其中,第一存储空间可以视为一种树型结构的数据,文件存储于page中。信息处理模块可以通过抽样哈希算法,获取文件的抽样数据。例如抽样page1、page3和page5的部分数据,分别构成抽样数据的首段循环冗余校验(cyclic redundancy check,CRC)、中段CRC和尾段CRC,如图8所示。再结合文件信息(例如文件类型、文件大小等信息),确定特征信息,例如也称为文件的指纹(fingerprint,FP)。其中,信息处理模块通过抽样计算的方式会使得计算特征信息的开销基本保持稳定,从而降低了抽样计算特征信息对存储***写入性能的影响。
3、信息处理模块根据第一文件,判断第二存储空间中是否存在第二文件,第二文件与第一文件相同。一种实现方式中,具体的判断方法包括:信息处理模块确定第一文件的特征信息,并根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件在第二存储空间中的存储地址相关联。其中, 若第二存储空间中存在第二文件的特征信息与第一文件的特征信息相同,则表示第二文件与第一文件相同,则第一文件为重复文件。应注意,特征信息是具有唯一性的一种信息,当第一文件的特征信息与第二文件的特征信息相同时,可以确定第一文件和第二文件为相同的文件。
4、在存在第二文件的情况下,文件操作模块将第一文件的链接标识与第二文件相关联,第一文件的链接标识用于获取第一文件。也就是说,当第一文件为重复文件时,将第一文件的链接标识与第二文件相关联,使得当查找第一文件时,可以获取与第一文件相同的第二文件。当第一文件的链接标识与第二文件关联上之后,即使第一文件被删除,通过第一文件的链接标识也可以查找到相同的文件(即第二文件),从而保证文件方位路径的准确性。
例如,图9为本申请实施例提供的一种对重复文件的操作流程的示意图。图9中的左边部分为文件访问列表,示出了写请求中包括的文件和文件的链接标识。其中,该文件访问列表包括两列,第一列为文件名,第二列为文件的链接标识(inode)。其中,文件的链接标识用于获取该文件。图9中的右边部分示出了文件目录的部分目录项(包括文件的链接标识和文件写入重复次数)。可以理解,文件目录存储于第二存储空间。例如,写请求中包括的文件A的inode1。终端设备将文件A存储于第一存储空间中,并判断第二存储空间中是否存在第二文件,第二文件与文件A相同。具体判断方式,例如信息处理模块根据文件A的特征信息,判断第二存储空间中是否存在第二文件,第二文件的特征信息与文件A的特征信息相同。若不存在第二文件,则表示文件A不为重复文件。将文件A写入文件目录中。由于文件A为首次写入,文件A的写入重复次数为1。再一次写请求中包括的文件为文件D,文件D的链接标识为inode1。终端设备将文件D存储于第一存储空间中,并判断第二存储空间中是否存在第二文件,第二文件与文件D相同。具体判断方式,例如信息处理模块根据文件D的特征信息,判断第二存储空间中是否存在第二文件,第二文件的特征信息与文件D的特征信息相同。若文件A与文件D的特征信息相同,则表示文件D与文件A相同,文件D为重复文件。在这种情况下,文件操作模块将文件D的链接标识与文件A的链接标识相关联,例如文件D的inode1指向被重复的inode1,此时inode1对应的文件写入重复次数更新为2,如图9的右边表格的第二行第二列所示。
通过该方法,不需要重复进行实质写操作,只需要将重复文件的链接标识通过硬链接的方式与已存储的相同文件相关联,以便在后续调用时通过链接标识获取已存储的相同文件。例如,图10为文件去重后的一种链接对应关系。其中,inode1的重复次数为2,表示存在相同的文件都链接到inode1。文件***只需要存储一次相同的文件。这种情况下,重复文件最终将从内存中丢弃,不产生外存写操作,实现在文件访问的路径中完成低开销的文件去重。并且,图10所示的链接对应关系中仍然包括文件D,则对于上层应用是无感的。可见,***将不存在额外的数据拷贝,不与其他进程争夺计算资源,有利于降低文件写开销。并且去重过程是在输入输出(input/output,I/O)路径上完成的,不需要后台线程或服务离线响应。
一种实现方式中,文件索引模块对索引目录的操作可以包括但不限于创建指纹、***指纹、检索指纹、删除指纹等。例如,在新建索引目录时,根据文件的特征信息创建索引目录中的文件,文件名为指纹。又例如,针对非重复文件,根据非重复文件的特征信息在索引目录中***一个文件,文件名为非重复文件的指纹。
一种实现方式中,在图6所示的操作流程中,当安卓***的终端设备面向社交软件执行文件去重方法时,上述步骤具体还可以是:
1、在安卓内核库中,修改典型写操作的代码:文件操作模块根据进程的应用ID判断当 前的写请求是否为社交软件发出的写请求;如果是社交软件发出的写请求,文件操作模块拦截该写请求,并调用文件缓存模块在内核中为目标文件建立独有的缓存空间(第一存储空间),用于缓存其写数据。
2、在安卓内核库中,修改典型关闭操作的代码:如果是社交软件发出的关闭请求,信息处理模块调用第一存储空间中第一文件的抽样数据,确定第一文件的特征信息;并在索引目录中检索是否存在第二文件的特征信息,第二文件的特征信息与第一文件的特征信息相同。若在索引目录中检索到相同的特征信息,则确定第一文件为重复文件,文件操作模块执行如图9所示的文件去重的操作。若在索引目录中未检索到相同的特征信息,则确定第一文件不为重复文件,文件操作模块调用第一存储空间中的第一文件替换文件***中的第二存储空间中的缓存数据,并设置标志位,使得第一文件的数据能够被文件***的后台线程同步回闪存。
下面对终端设备采用本申请实施例提供的文件去重方法的效果进行分析对比。表1为本申请实施例提供的一种存储空间对比表。其中,表1示出了在多次操作后,未去重设备空间占用和去重设备空间占用的对比。其中,多次操作可以包括但不限于:使用社交软件多次发送(视频/PPT/图片文件等)、使用浏览器多次保存文件到***存储,从一个应用多次调用视频/PPT/图片到其他应用(如从社交软件保存图片到图库,从图库调用文件到社交软件)。
表1:存储空间对比表
Figure PCTCN2021127162-appb-000001
可见,采用本申请实施例提供的文件访问方法,当应用程序多次进行重复操作时,终端设备的存储空间占用将不会依次增加,有利于降低存储空间的占用,并且对应用是无影响的。
一种示例中,如图6所示的操作流程为终端设备内部***的操作,对用户来说是不可见的。但是,为了优化用户体验,呈现技术价值,终端设备还可以通过界面显示或语音提示等方式向用户展示文件去重的效果。
一种实现方式中,终端设备默认关闭文件去重功能,需要经过用户授权才能开启文件去重功能。具体实现方式可以是获取指令,该指令指示开启文件去重功能;响应于该指令,执行获取写请求的操作。例如,终端设备在***设置等相关操作处提供文件去重功能的开关按钮,或者在新***安装、升级等环节向用户提示是否开启文件去重功能。若用户确定开启文件去重功能,用户可以在***设置中开启文件去重功能的开关按钮;对于终端设备来说,用户的这一操作转换为指令,该指令指示开启文件去重功能。响应于该指令,执行获取写请求的操作。
在开启文件去重功能的实现方式中,终端设备可以在输出用户提示。例如,在用户授权开启文件去重功能的界面或***升级提示界面中输出用户提示,用户提示可以包括但不限于:提示***可自动实时(或定时)实现应用透明、用户不用参与、开销极低的去重,实现存储 节省相关功能,如图11所示。又例如,终端设备可以通过语音播报的方式输出用户提示,向用户播报***可以自动实时(或定时)实现文件去重功能。
在开启文件去重功能的实现方式中,终端设备可以生成提示信息,提示信息可以包括但不限于:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型等。例如,在用户授权开启文件去重功能的界面中输出文件去重提示信息,该文件去重提示信息包括但不限于:提示***按累计,年,月,日等统计呈现(应用无感、用户不用参与)自动优化存储空间20GB,优化1000组内容相同文件,类别为视频等,如图11所示。
一种示例中,如图6所示的操作流程为终端设备内部***的操作,为了方便***及应用开发,终端设备还可以生成记录日志,记录日志包括但不限于:索引目录中的数据、第一文件标识对应的存储位置、第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。例如,终端设备可以生成文件去重功能的记录日志。该记录日志包括索引目录中的数据(例如索引目录中包括的一个或多个文件分别的特征信息和文件地址,可以是直接提供特征信息值和文件地址值,不用显示索引目录的数据结构),已删除的重复文件所释放的存储容量的具体值(例如已删除的重复文件所释放的存储容量为6GB),已删除的重复文件的数量(例如已删除重复文件1000组)等。
一种实现方式中,终端设备通过提供API给外部设备,使得外部设备可以通过API调用文件去重功能。例如,为方便***及应用开发、调试文件去重功能,终端设备提供调试API,以使外部设备可以调用文件去重功能,例如通过API调用文件操作模块和信息处理模块等,使得外部设备可以执行文件去重功能,如图12所示。可以理解,当外部设备通过API调用实现文件去重的功能模块时,文件操作模块、信息处理模块、文件缓存模块和文件索引模块之间的交互参考图6实施例中的描述,此处不再赘述。该实现方式中的外部设备例如可以是服务器,当服务器通过API调用文件去重功能时,可以实现对服务器的自动文件去重,能够有效去除重复文件。
下面对本申请实施例提供的文件去重方法的具体流程进行详细的描述。
图13为本申请实施例提供的一种文件去重方法的流程示意图,该文件去重方法流程由终端设备或者部署在云上的设备所执行,包括以下步骤:
S101,获取写请求,写请求中包括第一文件。
其中,写请求用于请求写入文件,请求写入文件的方式可以是应用程序发起文件访问请求,例如,通过pwrite函数等控制信令执行写操作。
S102,响应于写请求,存储第一文件,第一文件存储于第一存储空间。
当拦截写请求后,可以缓存写请求中包括的第一文件,具体实现方式参考图4a或图4b中对应的描述,此处不再赘述。
S103,确定第二存储空间中是否存在第二文件,第二文件与第一文件相同,第二存储空间与第一存储空间位于存储***的不同层。
其中,第一存储空间与第二存储空间位于存储***的不同层,表示第一存储空间和第二存储空间是层级上的不同。例如,第一存储空间为内存空间(例如缓存),第二存储空间为外存空间(例如磁盘)。也就是说,在文件访问过程中,将写请求中的第一文件暂存至内存空间,不写入外存空间,有利于降低写入外存空间的开销。并且判断第一文件是否为重复文件后,若为重复文件则直接从内存空间中删除第一文件,实现在线文件去重。
一种实现方式中,为了减少写入性能的损失,本申请实施例中通过抽样文件的部分数据 来确定文件的特征信息。终端设备根据第一文件的抽样数据,确定第一文件的特征信息。具体实现方式,参考图8所示的一种通过抽样数据确定特征信息的方法,此处不再赘述。
一种实现方式中,在不存在第二文件的情况下,将第一文件存储于第三存储空间,并在第三存储空间内对第一文件执行缓存区操作;执行完缓存区操作后,将第一文件存储于第二存储空间。例如,如图4a所示的内存空间中,第一存储空间是指文件缓存模块占用的缓存空间,第三存储空间是指VFS中的文件缓存。其中,第一存储空间的数据结构与第三存储空间的数据结构相同。例如,第一存储空间采用缓存的数据结构,在第一存储空间中可以执行缓存文件的操作;第三存储空间也采用缓存的数据结构,在第三存储空间中也可以执行缓存文件的操作。该实现方式实现整个去重操作过程中有两次串行数据拷贝,具体实现方式参考图4a和图7a中对应的描述,此处不再赘述。在执行完缓存区操作后,将第一文件从内存空间写入外存空间,完成文件访问流程。
一种实现方式中,在不存在第二文件的情况下,在第一存储空间内对第一文件执行缓存区操作;执行完缓存区操作后,将第一文件存储于第二存储空间。例如,如图4b所示的内存空间中,第一存储空间包括文件缓存模块占用的缓存空间,以及VFS中的文件缓存。该实现方式实现整个去重操作过程中只有一次数据拷贝,具体实现方式参考图4b和图7b中对应的描述,此处不再赘述。在执行完缓存区操作后,将第一文件从内存空间写入外存空间,完成文件访问流程。
一种实现方式中,在存在第二文件的情况下,将第一文件的链接标识与第二文件相关联,并从第一存储空间中删除第一文件。其中,第一文件的链接标识用于获取第一文件。具体实现方式,参考图9中对应的描述,此处不再赘述。
一种实现方式中,当确定第一文件的特征信息后,根据第一文件的特征信息确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件在第二存储空间中的存储地址相关联。其中,索引目录如图5所示。例如,计算第一文件的特征信息为指纹A1。通过查找如图5所示的索引目录,确定索引目录中存在指纹A1。则表示第一文件与第三文件的文件名相同,从而可以推导第三文件关联的文件A与第一文件为相同的文件,即第一文件为重复文件。其中,当索引目录中存在第三文件时,将第一文件的链接标识与第二文件相关联。具体实现方式,参考图9所示的一种文件关联的方式,此处不再赘述。
一种实现方式中,在索引目录中不存在第三文件的情况下,按照正常的文件访问流程将该第一文件写入文件***。
一种实现方式中,在索引目录中不存在第三文件的情况下,在索引目录中新建第四文件,第四文件的文件名为第一文件的特征信息,第四文件与第一文件在第二存储空间中的存储地址相关联。也就是说,当第一文件不为重复文件时,可以在索引目录中***新的指纹,从而有利于终端设备后续对其他文件的判断。例如,当再次拦截的写请求中包括第五文件时,判断索引目录中是否存在文件名与第五文件的特征信息相同。
一种实现方式中,该文件去重方法还包括以下步骤:
生成提示信息,提示信息包括以下一种或多种:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型。具体实现方式,参考前文实施例中对生成提示信息的描述,此处不再赘述。
一种实现方式中,该文件去重方法还包括以下步骤:
生成记录日志,记录日志包括以下一项或多项内容:索引目录中的数据、第一文件标识 对应的存储位置、第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。具体实现方式,参考前文实施例中对输出记录日志的描述,此处不再赘述。
一种实现方式中,该文件去重方法还包括以下步骤:
获取指令,该指令指示开启文件去重功能;
响应于该指令,执行获取写请求的操作。
具体实现方式,参考图11中对输出文件访问授权界面的描述,此处不再赘述。
本申请实施例提供了一种文件去重方法,该文件去重方法通过获取写请求,将写请求中的第一文件存储于第一存储空间中,并判断第二存储空间中是否存在第二文件,第二文件与第一文件相同。该方法能够有效去除终端设备的重复文件,减少存储空间占用;并且对应用无感,也无需用户进行复杂的操作,降低***的处理开销。并且,当第一文件的数据被删除后,通过第一文件的链接标识也可以查询到相同的第二文件,从而不影响文件的访问流程。
一种示例中,图14为本申请实施例提供的一种文件查找方法的流程示意图。该文件查找方法也可以由终端设备或者部署在云上的设备所执行,包括以下步骤:
S201,获取第一文件,并确定第一文件的特征信息。
其中,本实施例中的第一文件可以是写请求中包括的文件。例如,在线模式下当检测到写请求时,获取写请求中包括的第一文件。第一文件也可以是已写入文件***中的文件。例如,离线模式下检测文件***中的一个或多个文件,并分别确定一个或多个文件分别的特征信息。
一种实现方式中,根据第一文件的抽样数据,确定第一文件的特征信息。其中,抽样数据是通过采样算法从第一文件的数据中获取的部分数据。具体实现方式,参考图6和图8实施例中对确定第一文件的特征信息以及对抽样数据获取方法的描述,此处不再赘述。可以理解,通过抽样的方式获取第一文件的特征信息,有利于降低数据处理的开销。
S202,根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同。
其中,第三文件为索引目录中的文件,第三文件与第二文件的在第二存储空间的存储地址相关联,即表示第三文件所指向的第二文件已写入磁盘中,是***中已存在的文件。通过索引目录,可以查找***中是否已存在与第一文件相同的文件。
一种实现方式中,在索引目录中不存在第三文件的情况下,将第一文件存储于第二存储空间,并在索引目录中增加第四文件,第四文件的文件名为第一文件的特征信息,第四文件与第一文件的存储地址相关联。例如,计算第一文件的特征信息为指纹D4。通过查找如图5所示的索引目录,确定索引目录中不存在指纹D4。则表示***中不存在与第一文件相同的文件,第一文件为非重复文件。在如图5所示的索引目录中***第四文件,第四文件的文件名为指纹D4,并且第四文件指向第一文件在第二存储空间中的存储地址。
一种实现方式中,在索引目录中存在第三文件的情况下,将第一文件的链接标识与第二文件相关联,并从第一存储空间中删除第一文件。其中,第一文件的链接标识用于获取第一文件。具体实现方式,参考图9和图10实施例中对应的描述,此处不再赘述。
本申请实施例提供一种文件查找方法,该文件查找方法获取第一文件,并确定第一文件的特征信息;根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同。通过索引目录查找的方式有利于简化文件的查找流程。并且,当第一文件为重复文件,且重复文件被删除后,若需要访问对应的文件,可以访问到第 一文件的特征信息链接到的第二文件(与第一文件相同的文件),从而保持正常的文件访问。
为了实现本申请实施例提供的方法中的各功能,本申请实施例提供的装置或设备可以包括硬件结构和/或软件模块,以硬件结构、软件模块、或硬件结构加软件模块的形式来实现上述各功能。上述各功能中的某个功能以硬件结构、软件模块、还是硬件结构加软件模块的方式来执行,取决于技术方案的特定应用和设计约束条件。本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
图15为本申请实施例提供的一种设备1500,用于实现上述方法实施例中的文件去重功能或文件查找功能。该设备可以是终端设备或者部署在云上的设备,也可以是终端设备或者部署在云上的设备中的装置,或者能够和终端设备或者部署在云上的设备匹配使用的装置。其中,该设备可以为芯片***。设备1500包括至少一个处理器1502,用于实现本申请实施例提供的文件去重方法或文件查找方法中终端设备或者部署在云上的设备的功能。示例性地,处理器1502可以响应于写请求,将第一文件存储于第一存储空间中,具体参见方法示例中的详细描述,此处不做赘述。设备1500还可以包括至少一个存储器1503,用于存储程序指令和/或数据。存储器1503和处理器1502耦合。本申请实施例中的耦合是装置、单元或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。处理器1502可能和存储器1503协同操作。处理器1502可能执行存储器1503中存储的程序指令。所述至少一个存储器中的至少一个可以包括于处理器中。设备1500还可以包括通信接口1501,该通信接口例如可以是收发器、接口、总线、电路或者能够实现收发功能的装置。其中,通信接口1501用于通过传输介质和其它设备进行通信,从而用于设备1500中的装置可以和其它设备进行通信。示例性地,该其它设备可以是终端。处理器1502利用通信接口1501收发数据,并用于实现图13或图14对应的实施例中所述的终端设备或部署在云上的设备所执行的方法。本申请实施例中不限定上述通信接口1501、处理器1502以及存储器1503之间的具体连接介质。本申请实施例在图15中以存储器1503、处理器1502以及通信接口1501之间通过总线1504连接,总线在图15中以粗线表示,其它部件之间的连接方式,仅是进行示意性说明,并不引以为限。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图15中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
在本申请实施例中,处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
在本申请实施例中,存储器可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。
图16所示为本申请实施例提供的一种文件去重装置1600,该文件去重装置可以是终端 设备或者部署在云上的设备,也可以是终端设备或者部署在云上的设备中的装置,或者是能够和终端设备或者部署在云上的设备匹配使用的装置。一种设计中,该文件去重装置可以包括执行图13对应的示例中所描述的方法/操作/步骤/动作所一一对应的模块,该模块可以是硬件电路,也可是软件,也可以是硬件电路结合软件实现。一种设计中,该装置可以包括文件操作模块1601、文件缓存模块1602、信息处理模块1603。示例性地,文件操作模块1601用于获取写请求,写请求中包括第一文件。文件缓存模块1602用于响应于写请求,存储第一文件,第一文件存储于第一存储空间。信息处理模块1603用于确定第二存储空间中是否存在第二文件,第二文件与第一文件相同,第二存储空间与第一存储空间位于存储***的不同层。
示例性地,文件缓存模块1602还用于:
在不存在第二文件的情况下,将第一文件存储于第三存储空间,并在第三存储空间内对第一文件执行缓存区操作;
执行完缓存区操作后,将第一文件存储于第二存储空间。
示例性地,文件缓存模块1602还用于:
在不存在第二文件的情况下,在第二存储空间内对第一文件执行缓存区操作;
执行完缓存区操作后,将第一文件存储于第二存储空间。
信息处理模块1603还用于在存在第二文件的情况下,将第一文件的链接标识与第二文件相关联,第一文件的链接标识用于获取第一文件;
文件缓存模块1602还用于从第一存储空间中删除第一文件。
示例性地,信息处理模块1603还用于:
根据第一文件的抽样数据,确定第一文件的特征信息;抽样数据是通过采样算法从第一文件的数据中获取的部分数据。
示例性地,信息处理模块1603还用于:
确定第一文件的特征信息;
根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的文件名与第一文件的特征信息相同,第三文件与第二文件在第二存储空间中的存储地址相关联。
示例性地,文件去重装置1600还包括生成模块1604,生成模块1604用于生成提示信息,提示信息包括以下一种或多种:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型。
示例性地,生成模块1604还用于生成记录日志,记录日志包括以下一项或多项内容:索引目录中的数据、第一文件标识对应的存储位置、第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。
示例性地,文件去重装置1600还包括执行模块1605,执行模块1605用于获取指令,该指令指示开启文件去重功能;响应于该指令,执行获取写请求的操作。
图17所示为本申请实施例提供的一种文件查找装置1700,该文件查找装置可以是终端设备或者部署在云上的设备,也可以是终端设备或者部署在云上的设备中的装置,或者是能够和终端设备或者部署在云上的设备匹配使用的装置。一种设计中,该文件查找装置可以包括执行图14对应的示例中所描述的方法/操作/步骤/动作所一一对应的模块,该模块可以是硬件电路,也可是软件,也可以是硬件电路结合软件实现。一种设计中,该装置可以包括文件操作模块1701和信息处理模块1702。示例性地,文件操作模块1701用于获取第一文件,并确定第一文件的特征信息。信息处理模块1702用于确定所述第一文件的特征信息。信息处理模块1702还用于根据第一文件的特征信息,确定索引目录中是否存在第三文件,第三文件的 文件名与第一文件的特征信息相同,第三文件与第二文件的在第二存储空间的存储地址相关联。
示例性地,信息处理模块1702用于确定第一文件的特征信息,包括:
根据第一文件的抽样数据,确定第一文件的特征信息;抽样数据是通过采样算法从第一文件的数据中获取的部分数据。
示例性地,文件查找装置1700还包括文件缓存模块1703,文件缓存模块1703用于在索引目录中不存在第三文件的情况下,将第一文件存储于第二存储空间,并在索引目录中增加第四文件,第四文件的文件名为第一文件的特征信息,第四文件与第一文件的存储地址相关联。
示例性地,信息处理模块1702还用于在索引目录中存在第三文件的情况下,将第一文件的链接标识与第二文件相关联,第一文件的链接标识用于获取第一文件;
文件缓存模块1703还用于从第一存储空间中删除第一文件。
本申请实施例提供的技术方案可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、终端设备或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机可以存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD))、或者半导体介质等。在本申请实施例中,在无逻辑矛盾的前提下,各实施例之间可以相互引用,例如方法实施例之间的方法和/或术语可以相互引用,例如装置实施例之间的功能和/或术语可以相互引用,例如装置实施例和方法实施例之间的功能和/或术语可以相互引用。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (29)

  1. 一种文件去重方法,其特征在于,包括:
    获取写请求,所述写请求中包括第一文件;
    响应于所述写请求,存储所述第一文件,所述第一文件存储于第一存储空间;
    确定第二存储空间中是否存在第二文件,所述第二文件与所述第一文件相同,所述第二存储空间与所述第一存储空间位于存储***的不同层。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在不存在所述第二文件的情况下,将所述第一文件存储于第三存储空间,并在所述第三存储空间内对所述第一文件执行缓存区操作;
    执行完所述缓存区操作后,将所述第一文件存储于所述第二存储空间。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在不存在所述第二文件的情况下,在所述第一存储空间内对所述第一文件执行缓存区操作;
    执行完所述缓存区操作后,将所述第一文件存储于所述第二存储空间。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述方法还包括:
    在存在第二文件的情况下,将所述第一文件的链接标识与所述第二文件相关联,所述第一文件的链接标识用于获取所述第一文件,从所述第一存储空间中删除所述第一文件。
  5. 根据权利要求1至3任一项所述的方法,其特征在于,所述第二文件与所述第一文件相同,包括:
    所述第二文件的特征信息与所述第一文件的特征信息相同。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    根据所述第一文件的抽样数据,确定所述第一文件的特征信息,所述抽样数据是通过采样算法从所述第一文件的数据中获取的部分数据。
  7. 根据权利要求4至6任一项所述的方法,其特征在于,所述确定第二存储空间中是否存在第二文件,包括:
    确定所述第一文件的特征信息;
    根据所述第一文件的特征信息,确定索引目录中是否存在第三文件,所述第三文件的文件名与所述第一文件的特征信息相同,所述第三文件与所述第二文件在所述第二存储空间中的存储地址相关联。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述方法还包括:
    生成提示信息,所述提示信息包括以下一种或多种:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型。
  9. 根据权利要求1至7任一项所述的方法,其特征在于,所述方法还包括:
    生成记录日志,所述记录日志包括以下一项或多项内容:所述索引目录中的数据、第一文件标识对应的存储位置、所述第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。
  10. 根据权利要求1至7任一项所述的方法,其特征在于,所述获取写请求之前,所述方法还包括:
    获取指令,所述指令指示开启文件去重功能;
    响应于所述指令,执行获取写请求的操作。
  11. 一种文件查找方法,其特征在于,包括:
    获取第一文件,并确定所述第一文件的特征信息;
    根据所述第一文件的特征信息,确定索引目录中是否存在第三文件,所述第三文件的文件名与所述第一文件的特征信息相同,所述第三文件与第二文件的在第二存储空间的存储地址相关联。
  12. 根据权利要求11所述的方法,其特征在于,所述确定所述第一文件的特征信息,包括:
    根据所述第一文件的抽样数据,确定所述第一文件的特征信息;所述抽样数据是通过采样算法从所述第一文件的数据中获取的部分数据。
  13. 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:
    在所述索引目录中不存在第三文件的情况下,将所述第一文件存储于所述第二存储空间,并在所述索引目录中增加第四文件,所述第四文件的文件名为所述第一文件的特征信息,所述第四文件与所述第一文件的存储地址相关联。
  14. 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:
    在所述索引目录中存在第三文件的情况下,将所述第一文件的链接标识与所述第二文件相关联,所述第一文件的链接标识用于获取所述第一文件,从所述第一存储空间中删除所述第一文件。
  15. 一种文件去重装置,其特征在于,包括:
    文件操作模块,用于获取写请求,所述写请求中包括第一文件;
    文件缓存模块,用于响应于所述写请求,存储所述第一文件,所述第一文件存储于第一存储空间;
    信息处理模块,用于确定第二存储空间中是否存在第二文件,所述第二文件与所述第一文件相同,所述第二存储空间与所述第一存储空间位于存储***的不同层。
  16. 根据权利要求15所述的装置,其特征在于,所述文件缓存模块还用于:
    在不存在所述第二文件的情况下,将所述第一文件存储于第三存储空间,并在所述第三存储空间内对所述第一文件执行缓存区操作;
    执行完所述缓存区操作后,将所述第一文件存储于所述第二存储空间。
  17. 根据权利要求15所述的装置,其特征在于,所述文件缓存模块还用于:
    在不存在所述第二文件的情况下,在所述第二存储空间内对所述第一文件执行缓存区操作;
    执行完所述缓存区操作后,将所述第一文件存储于所述第二存储空间。
  18. 根据权利要求15至17任一项所述的装置,其特征在于,所述信息处理模块还用于在存在第二文件的情况下,将所述第一文件的链接标识与所述第二文件相关联,所述第一文件的链接标识用于获取所述第一文件;
    所述文件缓存模块还用于从所述第一存储空间中删除所述第一文件。
  19. 根据权利要求15至17任一项所述的装置,其特征在于,所述信息处理模块还用于:
    根据所述第一文件的抽样数据,确定所述第一文件的特征信息;所述抽样数据是通过采样算法从所述第一文件的数据中获取的部分数据。
  20. 根据权利要求18或19所述的装置,其特征在于,所述信息处理模块还用于:
    确定所述第一文件的特征信息;
    根据所述第一文件的特征信息,确定索引目录中是否存在第三文件,所述第三文件的文件名与所述第一文件的特征信息相同,所述第三文件与所述第二文件在所述第二存储空间中 的存储地址相关联。
  21. 根据权利要求15至20任一项所述的装置,其特征在于,所述装置还包括生成模块所述生成模块用于生成提示信息,所述提示信息包括以下一种或多种:已删除重复文件的提示、删除重复文件所释放的存储容量、删除重复文件的数量、重复文件的文件类型。
  22. 根据权利要求15至20任一项所述的装置,其特征在于,所述生成模块还用于生成记录日志,所述记录日志包括以下一项或多项内容:所述索引目录中的数据、第一文件标识对应的存储位置、所述第一存储空间中的数据、删除重复文件所释放的存储容量、删除重复文件的数量、删除重复文件的文件类型。
  23. 一种文件查找装置,其特征在于,包括:
    文件操作模块,用于获取第一文件;
    信息处理模块,用于确定所述第一文件的特征信息;
    所述信息处理模块还用于根据所述第一文件的特征信息,确定索引目录中是否存在第三文件,所述第三文件的文件名与所述第一文件的特征信息相同,所述第三文件与第二文件的在第二存储空间的存储地址相关联。
  24. 根据权利要求23所述的装置,其特征在于,所述信息处理模块用于确定所述第一文件的特征信息,包括:
    根据所述第一文件的抽样数据,确定所述第一文件的特征信息;所述抽样数据是通过采样算法从所述第一文件的数据中获取的部分数据。
  25. 根据权利要求23或24所述的装置,其特征在于,所述装置还包括文件缓存模块,所述文件缓存模块用于在所述索引目录中不存在第三文件的情况下,将所述第一文件存储于所述第二存储空间,并在所述索引目录中增加第四文件,所述第四文件的文件名为所述第一文件的特征信息,所述第四文件与所述第一文件的存储地址相关联。
  26. 根据权利要求23或24所述的装置,其特征在于,所述信息处理模块还用于在所述索引目录中存在第三文件的情况下,将所述第一文件的链接标识与所述第二文件相关联,所述第一文件的链接标识用于获取所述第一文件;
    所述文件缓存模块还用于从所述第一存储空间中删除所述第一文件。
  27. 一种设备,其特征在于,所述设备包括一个或多个处理器和存储器;所述存储器与所述一个或多个处理器耦合,所述存储器存储有计算机程序,所述一个或多个处理器执行所述计算机程序时,所述设备执行如权利要求1至14任一项所述的方法。
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现如权利要求1至14任一项所述的方法。
  29. 一种计算机程序产品,其特征在于,包括指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1至14任一项所述的方法。
PCT/CN2021/127162 2021-10-28 2021-10-28 一种文件去重方法、装置和设备 WO2023070462A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/127162 WO2023070462A1 (zh) 2021-10-28 2021-10-28 一种文件去重方法、装置和设备
CN202180103614.0A CN118120212A (zh) 2021-10-28 2021-10-28 一种文件去重方法、装置和设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/127162 WO2023070462A1 (zh) 2021-10-28 2021-10-28 一种文件去重方法、装置和设备

Publications (1)

Publication Number Publication Date
WO2023070462A1 true WO2023070462A1 (zh) 2023-05-04

Family

ID=86160400

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127162 WO2023070462A1 (zh) 2021-10-28 2021-10-28 一种文件去重方法、装置和设备

Country Status (2)

Country Link
CN (1) CN118120212A (zh)
WO (1) WO2023070462A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630290A (zh) * 2009-08-17 2010-01-20 成都市华为赛门铁克科技有限公司 重复数据处理方法和装置
CN103177111A (zh) * 2013-03-29 2013-06-26 西安理工大学 重复数据删除***及其删除方法
CN103324552A (zh) * 2013-06-06 2013-09-25 西安交通大学 两阶段单实例去重数据备份方法
US9189414B1 (en) * 2013-09-26 2015-11-17 Emc Corporation File indexing using an exclusion list of a deduplicated cache system of a storage system
CN105630834A (zh) * 2014-11-07 2016-06-01 中兴通讯股份有限公司 一种实现重复数据删除的方法及装置
CN106649676A (zh) * 2016-12-15 2017-05-10 北京锐安科技有限公司 一种基于hdfs存储文件的去重方法及装置
US9679040B1 (en) * 2010-05-03 2017-06-13 Panzura, Inc. Performing deduplication in a distributed filesystem

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101630290A (zh) * 2009-08-17 2010-01-20 成都市华为赛门铁克科技有限公司 重复数据处理方法和装置
US9679040B1 (en) * 2010-05-03 2017-06-13 Panzura, Inc. Performing deduplication in a distributed filesystem
CN103177111A (zh) * 2013-03-29 2013-06-26 西安理工大学 重复数据删除***及其删除方法
CN103324552A (zh) * 2013-06-06 2013-09-25 西安交通大学 两阶段单实例去重数据备份方法
US9189414B1 (en) * 2013-09-26 2015-11-17 Emc Corporation File indexing using an exclusion list of a deduplicated cache system of a storage system
CN105630834A (zh) * 2014-11-07 2016-06-01 中兴通讯股份有限公司 一种实现重复数据删除的方法及装置
CN106649676A (zh) * 2016-12-15 2017-05-10 北京锐安科技有限公司 一种基于hdfs存储文件的去重方法及装置

Also Published As

Publication number Publication date
CN118120212A (zh) 2024-05-31

Similar Documents

Publication Publication Date Title
KR101644666B1 (ko) 장치와 웹 서비스 간에 브라우저 캐시를 동기화하는 프로그래밍 모델
CN110018998B (zh) 一种文件管理方法、***及电子设备和存储介质
US9122582B2 (en) File system for maintaining data versions in solid state memory
US11836112B2 (en) Path resolver for client access to distributed file systems
US9778860B2 (en) Re-TRIM of free space within VHDX
JP5886447B2 (ja) ロケーション非依存のファイル
JP2016505960A (ja) 互換性を保つオフロード・トークン・サイズの拡大
US11132145B2 (en) Techniques for reducing write amplification on solid state storage devices (SSDs)
CN114185494B (zh) 内存匿名页的处理方法、电子设备及可读存储介质
WO2021008425A1 (zh) 一种***启动方法以及相关设备
CN113806300B (zh) 数据存储方法、***、装置、设备及存储介质
WO2023066182A1 (zh) 文件处理方法、装置、设备及存储介质
US20220253252A1 (en) Data processing method and apparatus
JP2014071904A (ja) コンピュータシステム及びコンピュータシステムのデータ管理方法
WO2023070462A1 (zh) 一种文件去重方法、装置和设备
CN111930684A (zh) 基于hdfs的小文件处理方法、装置、设备及存储介质
WO2023071043A1 (zh) 文件聚合兼容方法、装置、计算机设备和存储介质
CN113934691B (zh) 访问文件的方法、电子设备及可读存储介质
CN115495020A (zh) 文件处理方法、装置、电子设备和可读存储介质
WO2022252322A1 (zh) 基于特征标记的电网监控***内存库关系库同步方法
US11650748B1 (en) Method of delayed execution of eBPF function in computational storage
CN116661645B (zh) 显示应用卡片的方法、电子设备及可读存储介质
EP4120060A1 (en) Method and apparatus of storing data,and method and apparatus of reading data
US11892951B2 (en) Key packing for flash key value store operations
US11748259B2 (en) System and method to conserve device lifetime for snapshot generation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21961828

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE