A Financial Database for Beginners
“Bloomberg who?” in 6 steps.
By Tristan Rachman, Director of Technology (McGill)
Any person who has tried to find good quality financial data has immediately run into a brick paywall. From paying thousands for a Bloomberg terminal to subscription services offered by other companies, it may initially seem like a dead-end for us Fintech enthusiasts who don’t have access to company funds. More often than not, these individuals happen to be beginner programmers looking for a project to hone their skills and build off of, and that is the particular audience this article aims to serve.
Beginner projects are weird. They’re so essential yet so artificial and often insignificant. Early in a developer’s learning career, the right project choice can mean the difference between giving up and excelling. A significant reason for this is simply because of the nature of these projects: they’re unexciting (who wants to build a calculator?). Beginner projects should inspire you and be something that you can continuously build off of as you progress. This is what the following project guide intends to provide. If you are interested in the financial markets, you can use this to make something impressive and build a key component of your portfolio. Chip away at this thing 30 minutes a day for a year, and I promise you that you will be an exponentially better programmer and have a powerful market tool to boot.
STEP 1: Project Set Up
First and foremost, let’s create a main directory to house our project. You can call it whatever you would like, but for the purposes of this article I named it ‘FinanceApp’. Within this directory create two sub-directories and call them ‘tickerlist’ and ‘dailydata’. Now you’re ready for the next step, choosing the data source.
STEP 2: Data Source (Yahoo Finance API)
The best free alternatives I have found for high quality financial data in 2021 are the Yahoo Finance API and stooq.com. Alpha Vantage is an honourable mention as their data is technically free however you are limited to a certain number of API calls per day and in the long run, when building a comprehensive financial database, this can become quite a nuisance. For learning purposes we will be using the Yahoo Finance API through the ‘yfinance’ Python library. The Yahoo Finance public API has been deprecated since 2012, however, their information has been made available through sources like rapidAPI. In turn, yfinance is a python library which enables you to connect to the Yahoo Finance API and receive pre-formatted data back (in Panda’s DataFrames). This is a huge quality of life improvement and a major contributing factor as to why I personally cannot recommend this library enough for a personal project of this scale. Beyond the scope of this project however, I strongly recommend checking out the other features of this library and what it can do. In particular, how you can use this library to extend the capability of the basic system we make today.
STEP 3: Ticker List (Nasdaq Trader)
The next step in any financial data collection journey is finding a list of all the tickers or symbols we need to keep track of. Luckily, there are a few options to choose from here. We will be using nasdaqtrader.com. The symbol lists on this site get updated frequently and keep track of all the companies listed on the Nasdaq, NYSE, and more. Subsequently, this script will be tracking those exchanges, however you are welcome to get ticker lists for any exchange you please, so long as Yahoo Finance tracks the companies.
3a: Download Nasdaq Ticker List
Once you head to the above link, in the bottom left hand corner of the page you will see a drop down menu indicating ‘Downloadable Files’. Select ‘Nasdaq — Listed’ and click ‘Go’. It will take you to a page with raw text output and some information that should look something like this.
As you can see it gives us key information on the companies listed on the Nasdaq. Select and copy all of this text into a text file within your ‘tickerlist’ directory. Name the file ‘nasdaqlisted.txt’.
3b: Download NYSE Ticker List
Repeat the same steps above, however, instead of choosing ‘Nasdaq — Listed’, select ‘Other Exchanges’. Save this list as ‘nyse.txt’.
STEP 4: Saving the Tickers as JSON Data
Let’s write a python script called ‘tickers.py’. Inside this file put the code you see below.
def merge_tickers():
nasdaqtxt = ‘tickerlist/nasdaq.txt’
nysetxt = ‘tickerlist/nyse.txt’
tickers = {}
with open(nasdaqtxt,’r’) as file:
companies = file.readlines()
for company in companies:
company_data = company.split(‘|’)
tickers[company_data[0]] = {
‘ticker’ : company_data[0],
‘company’ : company_data[1],
‘exchange’ : ‘Nasdaq’,
‘ETF’ : False } with open(nysetxt,’r’) as file:
def exchange_type(type):
if type in (‘A’,’N’,’P’): return ‘NYSE’
else: return ‘OTHER’
def is_etf(flag):
if flag == ‘Y’: return True
else: return False
companies = file.readlines()
for company in companies:
company_data = company.split(‘|’)
if '.' not in company_data[0] and '$' not in company_data[0]:
tickers[company_data[0]] = {
‘ticker’ : company_data[0],
‘company’ : company_data[1],
‘exchange’ : exchange_type(company_data[2]),
‘ETF’ : is_etf(company_data[4]) } return tickers
What the above code does is simple: merge_tickers() parses each .txt file we created in the previous step.
For the nasdaq.txt file, it stores all the lines of the text file in an array, iterates through each line, and splits each line on ‘|’ as they are in the format “Symbol | Security Name | etc.”. This returns an array with each of those objects. We add this to the python dictionary we created called tickers, which saves particular data we want to track. Since no ETF’s are included in the Nasdaq data, the ETF flag is a constant false for these companies. Once it has finished iterating through each line it closes the file and moves on to the nyse.txt file.
The code for the nyse.txt file may seem more complicated as it has a few extra steps, however it essentially serves the same purpose. The added steps are solely because this text file includes ETF’s and contains some securities listed on exchanges other than the NYSE.
Once the nyse.txt file is closed (after the with keyword block), the function returns the tickers dictionary.
4a: Writing a Python Dictionary to JSON String
import jsondef write_to_json(data):
file_location = 'tickerlist/tickers.json'
json_string = json.dumps(data)
with open(file_location,'w+') as file:
file.write(json_string)
This one is short and sweet. The write_to_json(data) function takes as input the python dictionary we create with merge_tickers() and writes this JSON data to a .json file. The .json file will be the one which is specified by the file_location variable. Note, you have to ‘import json’ at the top of your python file.
4b: Putting ‘ merge_tickers()’ and ‘write_to_json()’ Together
companies = merge_tickers()
write_to_json(companies)
Add the above and run the “tickers.py” script and you should now have ticker data for over 9000 companies listed on the Nasdaq and NYSE!
STEP 5: Extracting the Data
Believe it or not, once you have an array of company tickers to call, collecting the information you need isn’t difficult. This is thanks to the yfinance python library. Let’s create a script called ‘extract.py’ in the project’s main directory and in that file insert the code below.
** Be sure to pip install yfinance **
import yfinance as yf"""
for this demo we are going to focus on end of day (C) pricing for these companies
"""def get_tickers():
file_location = 'tickerlist/tickers.json'
with open(file_location,'r') as file:
string = file.read()
data = json.loads(string)
tickers = [ticker for ticker in data]
return tickersdef initialize_db(tickers):
for ticker in tickers:
try:
data = yf.download(
tickers = ticker,
period='max',
interval='1d',
auto_adjust=True,
prepost=True,
proxy=None )
d = {}
for index, row in data.iterrows():
d[str(index).split(' ')[0]] = {
'O' : row['Open'],
'H' : row['High'],
'L' : row['Low'],
'C' : row['Close'],
'V' : row['Volume'] }
json_string = json.dumps(d)
file_location = 'dailydata/{}.json'.format(ticker)
with open(file_location,'w+') as file:
file.write(json_string)
except:
continue
The get_tickers() method is essentially the inverse of the write_to_json(data) function in the previous section. It reads the file, loads it as a python dictionary, and returns this dictionary.
The initialize_db(tickers) function is more interesting. First, we should discuss how we are storing the data. More experienced programmers will immediately notice we are storing all the information directly as JSON rather than using a traditional database such as SQL. This is because financial data largely gets written once and never gets touched again. In the case of our end of day pricing, this is especially true. Once we have the correct data written for that day, why would we ever have to update it? We only need to be able to read it. Conversely, SQL is more useful for CRUD data. It is overwhelmingly powerful when you have many data points in a system which constantly need to be created, deleted, accessed and/or updated. However, historical data just needs to be accessed, and a structured file system has proven very effective for projects of this nature and scope. For more information on this decision I would check out the stack exchange post linked here.
Now that the way we are structuring the data is clear, let’s go over what this script does. initialize_db(tickers) takes as input the list of tickers from the get_tickers() method, and for each symbol it makes a call using the download() method from the yfinance library. yf.download() then returns a DataFrame of historical end of day prices for the entire public history of that company. Next, the function iterates through all the rows of that DataFrame object and adds to the empty dictionary ‘d’, a dictionary key corresponding to the date and a value, which is a list of items relating to: the open, close, high, low, and volume values corresponding to that day. We then dump that dictionary to a JSON string and write it as a ‘{ticker}.json’ file within the ‘dailydata’ directory. The try and except clauses are there in the case of errors occurring. This is because company data may be formatted incorrectly due to the deprecated API (albeit very infrequently… the Yahoo Finance API is surprisingly robust for a decade of deprecation). In that instance we want to skip to the next ticker. This is generally bad practice and should not be seen as a viable long term solution, however in this instance it is acceptable for achieving a working first prototype.
tickers = get_tickers()
initialize_db(tickers)
Run the above two lines of code and your database will be ready to use! Fair warning, this will take some time!
STEP 6: Visualizing Your Data
This is the step in which you get to see the fruits of your labour. For this section we will be using the python libraries seaborn, pandas, and matplotlib.pyplot to make a basic visualization of your data. Let’s create a python file called ‘visualize.py’ and put the code below.
** Make sure to pip install seaborn, pandas, and matplotlib.pyplot **
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import jsondef plot(ticker):
# retrieve the data
file_location = ‘dailydata/{}.json’.format(ticker)
with open(file_location, ‘r’) as file:
json_string = file.read()
data = json.loads(json_string)
# format the data
try:
stock_prices = [ data[date][‘C’] for date in data ]
dates = [ date for date in data ]
assert len(stock_prices) == len(dates)
except AssertionError:
print(‘Error in retrieving company data’)
return
if len(stock_prices) > 2000:
stock_prices = stock_prices[-2000:]
dates = dates[-2000:]
data = { ‘Dates’ : dates,
‘Price’ : stock_price }
df = pd.DataFrame.from_dict(data)
# change the dates parameter so they are of type date
df[‘Dates’] = pd.to_datetime(df[‘Dates’], format = ‘%Y-%m-%d’)
# plot the data
sns.lineplot(data=df,x=’Dates’,y=’Price’)
plt.title(ticker)
plt.show()
In short, the plot(ticker) method takes in a company symbol, loads that companies historical end of day information from the database as a dictionary, formats that data as two arrays (one of dates and one of their corresponding close prices), then plots that information in a seaborn line plot. If the date and close price lists aren’t the same length it raises an AssertionError and prints there was a problem in data collection. Consequently, the method can’t plot the linechart and returns nothing. If no errors are detected, the function proceeds to check if the data is greater than 2000 data points in length. If that is the case it slices the array to include only the last 2000 points of data. This is because Matplotlib becomes very memory intensive and unresponsive beyond a certain size (however, there’s an opportunity to improve this tool by using a different plotting library). Once the data is formatted, we write this to a dictionary, save it as a pandas DataFrame, and use seaborn to directly plot the DataFrame as a linechart. You can use this function to see the end of day stock price over time of all the companies you tracked in the previous sections! And since it is stored on your local drive, you can access it anywhere, anytime! For example, let’s check out Tesla’s (TSLA) stock price over the last few years.
To do that run:
plot(‘TSLA’)
Which returns:
The nice thing about matplotlib dependent charts, such as the ones seaborn provides, is that they are very responsive out of the box. For example, they support things such as zooming in to see more granular data as well as keeping a history of saved views.
Oh, and here is GameStop’s (GME) stock price overtime… for … obvious reasons.
plot('GME')
In closing, this project is bare-bones, but that’s the point. For beginner projects, you have to start small. The cool thing about this project is that you can take many different avenues to improve it. You can use better visualization software, expand the database through the plenty of other use cases of the yfinance library, or build your own from scratch. In reality, it doesn’t matter what you do to improve this code; what matters is that you do something. At the end of the day, doing something with code is the only way you get better. Just think about how much progress you have made in building this over the last 30 minutes. Now imagine how versatile and useful a tool you could make in just a couple months!