Learning Curve: Extracting Information from 10Q filings

The information contains in this post is purely for educational purposes. I am not an R savant, so the code you will see below may not be efficient, but it does the job. Feel free to make it efficient.

Background

One of the ways I use to learn R is to pick a small side project, and break it down into multiple tasks. This post is part of a side project that I am working on, which is to extract balance sheet information of a company using their 10Q filings. I just want to start with one filing and extend it to multiple filings using a for loop. So lets start.

Any company that is traded in US has to file its Quarterly financial data (unaudited), which is publicly available. You can also go for 10K filings, which is filed at the end each year. But I like to see the short term progress of a company more than its yearly progress. You can find the data at the SEC website. For the illustration purpose, lets use Apple Inc. Lets say we want to get Apple’s financial information for the year2018.

Data

To see the 10Q filings, go to SEC website, and type in the company’ name, and navigate to 10Q filings.

Once the appropriate 10Q filing is selected, you will see URL of that filing, copy it. As a side note there is an R package called “Edgar” that you can use to get the filings in HTML format and download it in your local drive.

library(stringr)
library(rvest)
library(XML)
library(gsubfn)
library(data.table)

url <- "https://www.sec.gov/Archives/edgar/data/0000320193/000032019318000100/a10-qq320186302018.htm"

I have seen different companies using different terms in their filings, which is dependent on what kind of company it is. For example, Biotechnology companies use R&D expenses, whereas most Financial companies does not. So, make sure to use correct words if you are extracting a specific variable.

For balance sheet, I picked words “total assets” and “total liabilities”. You have to remember that strings are case sensitive. In some filings you might see “Total Assets”, in other filings “total assets”. I have not found any way to get around this problem. I think it is possible to convert the words in the html files into lower case and compare it to given string. Anyway, we will open each file, look for the words, and grab the first table, and extract all rows “tr” in the table.

#get all lines in the document 
lines <- url %>%
  read_html() %>%
  html_nodes(xpath=
  "//table[contains(., 'Total assets') and contains(., 'Total liabilities')]") %>%
  html_nodes(xpath = 'tr') 

#Creating an empty table
table <- data.frame(variables = character(), values = double(), stringsAsFactors=FALSE)
row_ <- 1 #initialize row number

The above code look for all tables in the document and picks the table taht contains our words. Once the table is identified, it copies all rows into “lines” variable.

#Look at each rows separately to extract the information
for(i in 1:length(lines)){
  # get content of the line
  linecontent <- lines[[i]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\r\n","",.)
  
  # attribute the content to free columns
  if (grepl("[[:alpha:]]", str_squish(linecontent[1]))  & nchar(linecontent[[1]]) < 30) { 
    # Check if the first cell contains any words and that word contains less than 30 letters 
    
    val <- str_extract(
      str_squish(linecontent[-1]), 
      "[[:digit:]]+(,\\d+)*") #If yes, then extract the line
    
    if (all(is.na(val), na.rm = TRUE)) {} 
    #Skipping lines that does not contain any numerical values 
    
    else {
      table[row_, ] <- #Add the extracted lines, and replace symbols like paranthesis. 
        c(str_squish(linecontent[1]), 
          gsubfn(".", list(","="", "("= "-", ")" = ""), 
                 linecontent[-1]
                 [which(!is.na(
                   str_extract(str_squish(linecontent[-1]), 
                                           "[[:digit:]]+(,\\d+)*")))[1]]))
      row_ <- row_ + 1 }
  }
 
}
balance_sheets <-  as.data.frame(na.omit(table), 
                                 stringsAsFactors=FALSE) #store the table
print(balance_sheets)

                       variables values
1      Cash and cash equivalents  31971
2       Accounts receivable, net  14104
3                    Inventories   5936
4   Vendor non-trade receivables  12263
5           Other current assets  12488
6           Total current assets 115761
7       Other non-current assets  22546
8                   Total assets 349197
9               Accounts payable  38489
10              Accrued expenses  25184
11              Deferred revenue   7403
12              Commercial paper  11974
13     Total current liabilities  88548
14 Deferred revenue, non-current   2878
15                Long-term debt  97128
16 Other non-current liabilities  45694
17             Total liabilities 234248
18             Retained earnings  79436
19    Total shareholders’ equity 114949

And, there you go.

Extracting Information from 10Q filings

Background

Data

Corrections