In this post, I explain how to extract financial information (statement of balance sheet) of Apple Inc. company from its third quarter 10Q filing for the year 2018.
The information contains in this post is purely for educational purposes. I am not an R savant, so the code you will see below may not be efficient, but it does the job. Feel free to make it efficient.
One of the ways I use to learn R is to pick a small side project, and break it down into multiple tasks. This post is part of a side project that I am working on, which is to extract balance sheet information of a company using their 10Q filings. I just want to start with one filing and extend it to multiple filings using a for loop. So lets start.
Any company that is traded in US has to file its Quarterly financial data (unaudited), which is publicly available. You can also go for 10K filings, which is filed at the end each year. But I like to see the short term progress of a company more than its yearly progress. You can find the data at the SEC website. For the illustration purpose, lets use Apple Inc. Lets say we want to get Apple’s financial information for the year2018.
To see the 10Q filings, go to SEC website, and type in the company’ name, and navigate to 10Q filings.
Once the appropriate 10Q filing is selected, you will see URL of that filing, copy it. As a side note there is an R package called “Edgar” that you can use to get the filings in HTML format and download it in your local drive.
I have seen different companies using different terms in their filings, which is dependent on what kind of company it is. For example, Biotechnology companies use R&D expenses, whereas most Financial companies does not. So, make sure to use correct words if you are extracting a specific variable.
For balance sheet, I picked words “total assets” and “total liabilities”. You have to remember that strings are case sensitive. In some filings you might see “Total Assets”, in other filings “total assets”. I have not found any way to get around this problem. I think it is possible to convert the words in the html files into lower case and compare it to given string. Anyway, we will open each file, look for the words, and grab the first table, and extract all rows “tr” in the table.
#get all lines in the document
lines <- url %>%
read_html() %>%
html_nodes(xpath=
"//table[contains(., 'Total assets') and contains(., 'Total liabilities')]") %>%
html_nodes(xpath = 'tr')
#Creating an empty table
table <- data.frame(variables = character(), values = double(), stringsAsFactors=FALSE)
row_ <- 1 #initialize row number
The above code look for all tables in the document and picks the table taht contains our words. Once the table is identified, it copies all rows into “lines” variable.
#Look at each rows separately to extract the information
for(i in 1:length(lines)){
# get content of the line
linecontent <- lines[[i]]%>%
html_children()%>%
html_text()%>%
gsub("\r\n","",.)
# attribute the content to free columns
if (grepl("[[:alpha:]]", str_squish(linecontent[1])) & nchar(linecontent[[1]]) < 30) {
# Check if the first cell contains any words and that word contains less than 30 letters
val <- str_extract(
str_squish(linecontent[-1]),
"[[:digit:]]+(,\\d+)*") #If yes, then extract the line
if (all(is.na(val), na.rm = TRUE)) {}
#Skipping lines that does not contain any numerical values
else {
table[row_, ] <- #Add the extracted lines, and replace symbols like paranthesis.
c(str_squish(linecontent[1]),
gsubfn(".", list(","="", "("= "-", ")" = ""),
linecontent[-1]
[which(!is.na(
str_extract(str_squish(linecontent[-1]),
"[[:digit:]]+(,\\d+)*")))[1]]))
row_ <- row_ + 1 }
}
}
balance_sheets <- as.data.frame(na.omit(table),
stringsAsFactors=FALSE) #store the table
print(balance_sheets)
variables values
1 Cash and cash equivalents 31971
2 Accounts receivable, net 14104
3 Inventories 5936
4 Vendor non-trade receivables 12263
5 Other current assets 12488
6 Total current assets 115761
7 Other non-current assets 22546
8 Total assets 349197
9 Accounts payable 38489
10 Accrued expenses 25184
11 Deferred revenue 7403
12 Commercial paper 11974
13 Total current liabilities 88548
14 Deferred revenue, non-current 2878
15 Long-term debt 97128
16 Other non-current liabilities 45694
17 Total liabilities 234248
18 Retained earnings 79436
19 Total shareholders’ equity 114949
And, there you go.
If you see mistakes or want to suggest changes, please create an issue on the source repository.