circle-infoKnow More

This section answers commonly asked questions about the design decisions, technologies, and implementation strategies used in the File Deduplication Project.

chevron-rightWhy use SHA-256 instead of SHA-1 or MD5?hashtag
  • SHA-256 provides a higher level of security and lower collision probability compared to SHA-1 and MD5.

  • Since our system relies on file hashes to identify duplicates, using a strong and secure hashing algorithm like SHA-256 helps prevent accidental collisions and ensures data integrity.

chevron-rightHow does the deduplication logic work?hashtag

Every file uploaded is hashed using SHA-256. This hash becomes the unique identifier (FileHash) for the file.

  • If the hash already exists in DynamoDB, it means the file is a duplicate, and the user is simply linked to the existing copy.

  • If the hash does not exist, the file is uploaded to S3 and the metadata is stored in DynamoD

chevron-rightHow does the system track multiple users for the same file?hashtag

DynamoDB stores a list of user IDs under the Users attribute for each file hash. Every time a new user uploads a duplicate file, their ID is appended to this list. This allows one file in S3 to be referenced by multiple users independently.

chevron-rightWhat happens when a user deletes a file?hashtag

When a user deletes a file:

  • Their reference is removed from the Users list in DynamoDB.

  • If no users remain, the file is also deleted from S3 and the DynamoDB entry is removed.

  • If other users are still linked, the file remains and only the user's reference is cleared.

chevron-rightDoes this support file versioning?hashtag

Not directly. However, if a user uploads a modified version of the file (i.e., a new hash), it is treated as a new file and stored separately. Versioning can be implemented by enabling S3 versioning and associating versions with user metadata in DynamoDB.

chevron-rightWhy use DynamoDB instead of S3 metadata or a SQL database?hashtag

DynamoDB offers fast, scalable key-value storage with built-in support for map and list types — perfect for storing file hashes and associated user lists. Unlike traditional SQL databases, it requires no server maintenance and scales automatically with usage.

chevron-rightWhy ownCloud over Nextcloud?hashtag

Both ownCloud and Nextcloud are capable, open-source file hosting platforms. However, for this project:

  • ownCloud has a more stable API and lighter base install — making it easier to connect with external systems like AWS.

  • The official enterprise version of ownCloud also aligns better with commercial deduplication use cases.

  • Your EC2 setup and cookbook were optimized for ownCloud, making it a smoother integration path.

That said, the architecture is modular — it can work with Nextcloud too by tweaking API or WebDAV endpoints.

chevron-rightCan this work with large files?hashtag

Yes — but with considerations:

  • The architecture supports large files as long as they fit within S3 and Lambda limits.

  • Lambda has a 6 MB payload limit for direct API calls. For larger files, use:

    • Pre-signed S3 URLs for direct upload

    • A temporary staging bucket with event triggers

  • Also ensure multipart upload and streaming hash calculation if integrating with big file systems

So yes, large files are supported with minor architectural adjustments.

chevron-rightWhat are real-world use cases of this project?hashtag

This project solves a fundamental storage problem and can be adapted across industries:

  • Educational Platforms Prevent repeated uploads of the same study material by students

  • Team Collaboration Tools One person uploads a file, others get instant access — no duplication, no clutter

  • Enterprise File Backup Systems Back up user devices or folders while only storing unique files once

  • Digital Archives or Media Libraries Avoid storing the same video/image across different projects or clients

🔗 Reach out to us:

Last updated