Know More

This section answers commonly asked questions about the design decisions, technologies, and implementation strategies used in the File Deduplication Project.
Why use SHA-256 instead of SHA-1 or MD5?
SHA-256 provides a higher level of security and lower collision probability compared to SHA-1 and MD5.
Since our system relies on file hashes to identify duplicates, using a strong and secure hashing algorithm like SHA-256 helps prevent accidental collisions and ensures data integrity.
How does the deduplication logic work?
Every file uploaded is hashed using SHA-256. This hash becomes the unique identifier (FileHash
) for the file.
If the hash already exists in DynamoDB, it means the file is a duplicate, and the user is simply linked to the existing copy.
If the hash does not exist, the file is uploaded to S3 and the metadata is stored in DynamoD
How does the system track multiple users for the same file?
DynamoDB stores a list of user IDs under the Users
attribute for each file hash. Every time a new user uploads a duplicate file, their ID is appended to this list. This allows one file in S3 to be referenced by multiple users independently.
What happens when a user deletes a file?
When a user deletes a file:
Their reference is removed from the
Users
list in DynamoDB.If no users remain, the file is also deleted from S3 and the DynamoDB entry is removed.
If other users are still linked, the file remains and only the user's reference is cleared.
Does this support file versioning?
Not directly. However, if a user uploads a modified version of the file (i.e., a new hash), it is treated as a new file and stored separately. Versioning can be implemented by enabling S3 versioning and associating versions with user metadata in DynamoDB.
Why use DynamoDB instead of S3 metadata or a SQL database?
DynamoDB offers fast, scalable key-value storage with built-in support for map and list types — perfect for storing file hashes and associated user lists. Unlike traditional SQL databases, it requires no server maintenance and scales automatically with usage.
Can this project be extended to support folders or shared links?
Yes. The architecture is flexible and can be extended to:
Track folders using user-specific paths
Create expirable share links (using signed S3 URLs)
Add tags or metadata per user for advanced file organizati
Why ownCloud over Nextcloud?
Both ownCloud and Nextcloud are capable, open-source file hosting platforms. However, for this project:
ownCloud has a more stable API and lighter base install — making it easier to connect with external systems like AWS.
The official enterprise version of ownCloud also aligns better with commercial deduplication use cases.
Your EC2 setup and cookbook were optimized for ownCloud, making it a smoother integration path.
That said, the architecture is modular — it can work with Nextcloud too by tweaking API or WebDAV endpoints.
Can this work with large files?
Yes — but with considerations:
The architecture supports large files as long as they fit within S3 and Lambda limits.
Lambda has a 6 MB payload limit for direct API calls. For larger files, use:
Pre-signed S3 URLs for direct upload
A temporary staging bucket with event triggers
Also ensure multipart upload and streaming hash calculation if integrating with big file systems
So yes, large files are supported with minor architectural adjustments.
What are real-world use cases of this project?
This project solves a fundamental storage problem and can be adapted across industries:
Educational Platforms Prevent repeated uploads of the same study material by students
Team Collaboration Tools One person uploads a file, others get instant access — no duplication, no clutter
Enterprise File Backup Systems Back up user devices or folders while only storing unique files once
Digital Archives or Media Libraries Avoid storing the same video/image across different projects or clients
🔗 Reach out to us:
Last updated