Data Warehousing Stock Market Data

Justin
5 min readMar 28, 2021

--

This is about a service I built that gives free stock market data daily packaged as a convenient CSV. Why? You might ask, this is a service to help our customers to find insights, back-test strategies, and explore the stock market in a different format than what is usually presented on Google Finance, Yahoo Finance, and other sites. This is different than other sites since this gives a fine grained look at the stock market and the transactions that it’s made of.

Building this scraper is complicated since there are over 6000 different tickers to search for, and 6.5 hours of data to scrape for (The market is open from 9:30 to 4) If we scrape for data once every 1 minute, then this comes out to 6,000*390 = 2,340,000 different rows per data. We’re interested in the open, close, high, low, and volume for each bar as well as a timestamp and the symbol to associate that bar. We can represent the open, close, high, low and volume as floats and integers. A floating point is 4B, ints are 4B. From the critical data we have about 20B together. We can use an integer to represent the timestamp, as an epoch, and a varchar of 8 characters to represent the ticker. This comes out to a total data size of 25B per row. 25 * 2,340,000 is 58,500,000 B, which is 58.5 MB per day. That’s a lot of data!

We need to consider several principles of data warehousing, namely value, cost, adaptability, and self service. With these principles in mind, this requires strong consideration for the data warehousing method. Since the columns are small, we won’t need any big data storage for this data stream. One method of consideration is storing it in a SQL database. This has obvious merits, as our data is now searchable, and it allows us to create complex queries to get insights without having to writing a script to load all the data into memory. An issue, however is the insert time into SQL. If we were to use a single SQL instance, we could be throttled. Since data is stored in a unified database, this would make it costly and time consuming to export since we would have to scrape through GBs of data to export subsets. Since we are launching this product on a budget, SQL can scale the costs of this operation very quickly but is out of the budget of the project and doesn’t align with the data warehousing principles.

Our solution was to create a data warehousing pipeline built on S3. Using a web service bucket, like S3, we were able to create data storage format that is scalable, low cost and makes it easy to provide self service tools. This makes it possible to warehouse large amounts of data for a low cost, that makes this operation possible for a long time while being unprofitable. Using a centralized server running a multithreaded scraper, we are able to decentralize the scraping components to make a resilient and cost efficient data pipeline.

Our approach to data warehousing was a multistep approach with a couple phases of iteration. The approach was to consider what was the smallest unit that we could subdivide the data so that we may retrieve it optimally. Since a typical file will be around 60 MB, we knew that storage size would increase extremely rapidly. Over 365 days, this would be 60 * 365 = 21.9 GB. Over 5 years, we would have 109.5 GB Stored. This solution would require us to be able to store and reference this data quickly. We implanted a data warehouse index using a persistent database, like SQL, for exactly this purpose. With this we are able to query for a certain date and return the data that is associated with it.

However, querying the data warehouse every time we need data is expensive, and extremely inefficient. This would go against the principles data warehousing, which is cost, and value. To remedy this we implemented a cache. We realized that recent data will be referenced more often. Since our value proposition is providing daily stock market data, we store the most recent data locally in a cache to serve quickly.

Since we want to provide the most relevant and recent data quickly, the cache is deprecated once an hour to ensure that we are always serving the most relevant data to our customers.

Our most valuable asset is our data, and because of that security is a top concern. Any leak happened to our data would be catastrophic to our business and our customers, who entrusted us with their data. It’s imperative that we hid all of our data in a VPC hosted on the Amazon network. This ensures that only our trusted sources, namely our storefront had access to our data.

With the implementation of the public/private separation we are able to protect our assets while delivering immense value, for a low cost. Subscribe to our Substack https://dsmd.substack.com, and check out https://dailystockmarketdata.com for more info, thank you for reading and happy trading!

--

--