How to Sync Large Videos over the Internet using Chunks

5 min readAug 7, 2020

Recently I took the extra time during quarantine and launched a project, sparkvid.com. Sparkvid is a tool that transcribes audio in video files and audio files to text.

One challenge in particular that was interesting was the problem of sending large files over a network

Background

Interactions happen over the internet happen through the exchange of information from one server to client. The information is stored in the format of packets.

These packets hold a fixed set of information, usually a couple kilobytes and is transferred over the physical wire to the destination. Once the destination server interprets the data in the packet, the transaction is complete.

In most use cases, transfers are small.

The info to load a FB news feed takes a couple hundred KB. Instagram photos stay below 1 MB. This article will probably be under 3MB. Because of this the information transfer is fast and almost instantaneous.

The Problem

When information gets larger, that’s when things gets trickier.

Remember when you download a big file only to have it stop unexpected around 90-95%?

This is similar to the challenge of uploading files in SparkVid.

Since the maximum upload size of a video on SparkVid is theoretically unlimited, this poses unique challenges of memory, network latency, and stability from the client. 4k Videos can get BIG.

To ensure the best customer experience, there needs to be way to upload a video file and a way to pick up an upload in the case of unexpected drops.

The Solution

The best way to solve the problem is to reimagine the problem. Instead of viewing a video as a sequential file, we can model video as a long stream of data.

We use a block size (32MB) and dividing the upload into a number of fixed sized blocks. We can use these blocks to upload different sections of the video independently of each other.

Client Upload Script in JS

This code will read a file selected and split it up into 32MB chunks. It will then upload these chunks one at a time until all the chunks are uploaded.

This will make sure our data is sent properly to the server.

Server Side

On the server, we would then have to reconfigure the chunks. There are a few things we should consider before writing the code.

Since chunks can arrive out of order when the client is sending multiple chunks to maximize bandwidth. We need to have a consistent hashing algorithm to handle this. There needs to be a way to associate chunk date between upload parts.

We can solve this using a redis cache to persist data between requests and order the chunks by chunk id.

Backend in GO

Advantages

This implementation is advantageous in several ways

We would be able to fully utilize the client’s bandwidth to upload. Since we can open multiple http requests, we will be able to use as much downtime as possible during the TCP handshake
Users will be able to resume uploads since we will be able to record the blocks that are missing on the server side
We will be able to utilize memory more efficiently on the server
Lastly this will allow for theoretically unbounded file sizes

Considerations

Of course there are some considerations we need to take into account. Since chunks have to be a decent size to be worthwhile to transfer, and not too small to reach diminishing returns when continuously creating new http requests. We need to find a good middle ground for chunk size.

We ended up going with 32MB as a reasonable balance

Every time we create a new ajax request we want to avoid a TCP handshake.

We approached this by using an HTTP keep alive connection

https://www.imperva.com/learn/performance/http-keep-alive/

We need a way to temporarily store the chunk information to be referenced when all the chunks have arrived

We can use a shared redis queue to hold the location of the chunks
We can use a shared file system (HDFS) to share the data in the chunks

Metadata File

The metadata file is important because we can to make sure the file chunks can be related to each other when we’re waiting for the other chunks. Since video uploads are all or nothing, write once read many, we don’t need to store these datas persistently.

We can use redis to cache the chunk information for a fast reference between chunk uploads.

Putting it Together

Once the file is on the server, the rest is simple. Since the upload, recombining, and processing are 3 separate steps, we can divide this into 3 separate micro services

We can use Redis as a message broker

Conclusion

This was a great challenge to understand the difficulties in building a large distributed system. Check out SparkVid.com for audio and video transcription. Your first 90 minutes are free.