Why is chunking up files with IPLD useful?

Expert

What are the advantages of chopping files into pieces?

Answers 1

Chunking a file is useful, but having an excessive amount of knowledge about the file's structure can be problematic. There are numerous approaches to chunking and there are already some algorithms available in IPFS. Additionally, there's a debate between whether or not this requires IPFS support or if it should be handled by IPLD instead. Because of this, it might be beneficial for any new chunker development projects to have an ADL that presents itself as bytes for better reusability. Files can be easily "chunked" into equally-sized parts by dividing them into one megabyte pieces. Unfortunately, if the same file is resized (adding even a single byte to the beginning), chunks will no longer match up; deduplication is lost. To keep deduplication in cases like this, rolling checksums using algorithms such Rabin-Karp are used. With rolling checksums, a consistent window of bytes are hashed and if the first sixteen characters of the number produced equal zero, then a chunk edge is signaled. Rolling checksums create "sticky" chunk edges which remain relatively unchanged when tiny edits to files occur - meaning good deduplication despite minor changes. When delving into the specifics of certain formats, such as pinpointing the end of certain headers or locating keyframes in videos, one can reach a point where additional effort yields diminishing returns and even outright negative returns. Such gains generate ever-increasing levels of code complexity, leading to an unacceptable growth in cost (particularly if one is designing protocols that must be implemented across all languages). Furthermore, dedup may become worse system-wide due to variation between implementations due to such complexities. This serves as an important reminder to consider the broader ecosystem when creating new designs and solutions.