DATA FINDING, SHARING AND DUPLICATION REMOVAL IN THE CLOUD
Janhavi Rahul Patil, Tanishka Manoj Suryavanshi, Unnati Karbhari Bhamare, Arya Narendra Joshi, Prof.P.A.Agrawal
1 janhavipatil821@gmail.com, Student of Poly. Dept. of Computer Technology K.K.Wagh , Nashik, India
2 tanusuryvanshi14@gmail.com, Student of Poly. Dept. of Computer Technology K.K.Wagh , Nashik, India
3 unnatibhamare590@gmail.com, Student of Poly. Dept. of Computer Technology K.K.Wagh, Nashik, India
4 aaryajoshi1807@gmail.com, Student of Poly. Dept. of Computer Technology K.K.Wagh, India
5Internal Guide Dept. of Computer Technology K.K.Wagh, Nashik, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Deduplication involves eliminating duplicate or redundant data to reduce stored data volume, commonly used in data backup, network optimization, and storage management. However, traditional deduplication methods have limitations with encrypted data and security. The primary objective of this project is to develop new distributed deduplication systems that offer increased reliability. In these systems, data chunks are distributed across the Hadoop Distributed File System (HDFS), and a robust key management system is utilized to ensure secure deduplication with slave nodes. Instead of having multiple copies of the same content, deduplication removes redundant data by retaining only one physical copy and referring other instances to that copy. The granularity of deduplication can vary, ranging from an entire file to a data block. The MD5 and 3DES algorithms are used to enhance the deduplication process. The proposed approach in this project is the Proof of Ownership (POF) of the file. With this method, deduplication can effectively address the issues of reliability and label consistency in HDFS storage systems. The proposed system has successfully reduced the cost and time associated with uploading and downloading data, while also optimizing storage space.
Key Words: Cloud computing, data storage, file checksum algorithms, computational infrastructure, duplication.