Data Extraction from SEC Filings: 10-K, 10-Q, & 8-K Forms

There's a growing need in the financial markets for faster access to the financial information that supports trading decisions. Investors see these financial documents as being crucial for understanding the financial status of companies and also to avoid fraud. It's quite hard to search and go through all the voluminous documents filed with the SEC to obtain the information an investor needs. Therefore to save time, we'll have to rely on intelligent automation systems to extract and store all the required data from these financial documents. In this blog, we'll learn how one can parse or extract information from these SEC forms. Additionally, let’s also look at different techniques that are used for achieving this task that include Optical Character Recognition and Deep Learning.

What are SEC filings?

The U.S. Securities and Exchange Commission (SEC) is an independent federal government administrative agency responsible for protecting investors, maintaining fair and orderly functioning of the securities markets, and facilitating capital formation. To achieve this, it requires all public companies, company insiders, and broker-dealers to file financial statements annually. Based on these documents, investors can review a company's profile and it’s activities. Now, let's see some of the most common forms that companies are required to submit to the SEC.

What are 10K forms & how to read them?

A document filed annually by publicly traded companies with the SEC, the 10-K Form is a detailed summary of a company's financial performance, business strategies, risk factors, and legal proceedings. Common items to review are the financial statements (including an income statement, balance sheet, and cash flow statement), Management’s Discussion and Analysis (MD&A) and related notes to the financial statements.
Below are some of the fields that give a quick overview of a company. It is essential to extract and store such information from these forms.

Company Name
Company Address
Employer Identification Number
Common Stock List
Current Share Values
Products List
Security Companies List
Balance Sheet Tables
Income Sheet Tables
Cash Flow Statements

These are some of the fields that are commonly looked at but there are a lot of others to look at too. All companies must file these 10K forms within 60 to 90 days of the close of their fiscal year. This 10K form seems large, right? That’s why the SEC introduced 10Q forms, which are truncated versions of 10K forms. Let’s learn about them in the next section.

What are 10Q forms & how to read them?

The 10-Q forms provides financial data and company results between the filing of 10-K. Essential sections include unaudited financial statements, MD&A, and disclosures addressing market risks and litigation contingencies.

The 10Q form is a quarterly, unaudited report. They are traditionally the comprehensive report of a company's performance. Usually, companies are required to disclose the 10Q documents with relevant information regarding their financial position. There is no filing after the fourth quarter because that is when the 10-K is filed. Below are some of the essential information that can be extracted from 10Q forms.

Financial Position of Company in Tables
Management Discussions
Working Capital
Amounts Used and Received
Market Risks

To perform information extraction from these forms, one must be aware of both table extraction and key-value pair extraction as there are no specific templates.

What are 8K forms ?

Filed on a current report, The 8K form informs the stakeholders about significant events or corporate changes which could be useful for investors—such as mergers, changes in management, or results from the financial operations of a corporation before the necessary information can be transmitted through normal channels.

The Form 8-K is what a company uses to disclose significant developments that occur between filings of the Form 10-K or Form 10-Q. Reading an 8K should involve reading through the published disclosures and then working to understand how it could potentially impact the company’s business and financial health.

In the next section, let’s dive into how we can extract information from these forms.

How to extract data from SEC forms?

Unlike invoices or receipts, extracting information from SEC forms is quite a challenging task. As every company has its way of representing all this financial information in different tables and key-value pairs, we'll have to make sure that the OCR algorithms we use are more robust and intelligent. Before learning more about these techniques, let's understand what OCR is about.

The OCR techniques are not new, but they have been continuously evolving with time. Out of these, one popular and commonly used OCR engine is Tesseract. It's an open-source python-based software developed by Google. However, even popular tools like Tesseract or Power Automate fail to extract text in some complex scenarios. They blindly extract text from given images without any processing or rules. Hence they require some intelligent algorithms backing them; this is where deep learning comes into the picture.

Here’s an example 10Q form found online. The output shown below is what we would get if we use Tesseract to extract all the tables and important information:

  	$	6,775	  	  	$	8,575	  
Costs and expenses:	  			
Cost of revenues (including stock-based compensation expense of $6 and $49)

  	 	2,452	  	  	 	2,936	  
Research and development (including stock-based compensation expense of $191 and $237)

  	 	818	  	  	 	1,226	  
Sales and marketing (including stock-based compensation expense of $54 and $78)

  	 	607	  	  	 	1,026	  
General and administrative (including stock-based compensation expense of $40 and $68)

This isn’t orderly or usable, right? As discussed, the core job of OCR is to extract all the text from a given document irrespective of template, layout, language, or fonts. But our goal is to pick all the critical information like customer name, form type, and financial details from the SEC forms that aren't handled by the top OCR engines like Tesseract and others. Therefore, we rely on deep learning which is trained on huge datasets and enable the models to learn. Let’s discuss them in the next section.

How to automate data extraction from SEC forms?

OCR and NLP are the best form of technologies to automate the data extraction from the SEC forms. These tools can pull structured and unstructured data from PDFs or HTML filings, extracting pertinent data like financial numbers, dates, notes from events, etc. We all know that the EDGAR database isn’t easy to navigate so trying to pull that information manually is a daunting task and the chances of errors are much higher. However, by using automation, data can be scanned in a fraction of the time.

Using programming APIs, automation tools can crawl through SEC filings in a fraction of the time, searching for specific sections or data points and then extracting relevant information in a structured format such as a spreadsheet. Automatic extraction, when compared to manual extraction, is a time-saver, and bypasses the risk of error from manual data entry.

Available Datasets for SEC Forms

To make the OCR and the deep learning models, one must train them with consistent data sets. Currently, there are no great tools available online that can automatically extract information from any form. Therefore, after we collect datasets, we’ll have to make sure we build a state of the art deep learning model that does this job. First things first, let’s see how we can prepare a dataset.

As this is publicly available data, we can download them company wise or use the available checkpoints present in open-source projects.

The SEC filings index is split into quarterly files since 1993 (1993-QTR1, 1993-QTR2...) and these can be found online here.
We can use the python-Edgar repository to download the SEC forms using the Python scripts.
Several forms are publicly available in this link here.

Once datasets are downloaded, the next step is to use an annotator to annotate all the required information in the SEC forms. Using these annotation files, we can train the deep learning model. Here are links to some of the open-source annotation tools available on Github.

In the next section, let’s look at a few deep learning models we can use for Information Extraction.

Some Popular Deep Learning Architectures and NLP Techniques

There are two ways for information extraction using deep learning, one building algorithms that can learn from images, and the other from the text.

Let's dive into deep learning and understand how these algorithms identify key-value pairs from images or text. Also, especially for SEC forms, it's essential to extract the data in the tables, as most of the information in SEC forms are mentioned in tabular format. Now, let's review some popular deep learning architectures for scanned documents.

LayoutML: LayoutML is an open-source project by Microsoft for document image understanding. The authors propose pre-training techniques and models that are hugely based on NLP. The idea behind LayoutML is to jointly model interactions between text and layout information across the scanned document image. This is beneficial for a large number of real-world document image understanding tasks like information extraction from scanned documents. We can use models like these to train the EDGAR datasets and build intelligent algorithms for key-value extraction tasks.
CUTIE (Learning to Understand Documents with Convolutional Universal Text Information Extractor): In this research, Xiaohui Zhao proposes extracting key information from documents like receipts or invoices and preserving interesting texts to structured data. The heart of this research is the convolutional neural networks that are applied to texts. Here, texts are embedded as features with semantic connotations. This model is trained on 4, 484 labeled receipts and has achieved 90.8%, 77.7% average precision on taxi receipts and entertainment receipts, respectively.
Named Entity Recognition: Named Entity Recognition allows us to evaluate a chunk of text and find out different entities from it - entities that don't just correspond to a category of a token but apply to variable lengths of phrases. The models take into consideration the start and end of every relevant phrase according to the classification categories the model is trained for. Therefore, for SEC documents, we can train NER models to perform key-value pair extraction.
BERTgrid is a popular deep learning-based language model for understanding generic documents and performing key-value pair extraction tasks. This model also utilizes convolutional neural networks based on semantic instance segmentation for running the inference. Overall, the mean accuracy of the selected document header and line items is 65.48%.
In DeepDeSRT, Schreiber et al. presented the end-to-end system for table understanding in document images. The system contains two subsequent models for table detection and structured data extraction in the recognized tables. It outperformed state-of-the-art methods for table detection and structure recognition by achieving F1-measures of 96.77% and 91.44% for table detection and structure recognition, respectively. Models like these can be used to extract values from tables of pay slips exclusively.

Benefits of automating data extraction from SEC forms

Efficiency: Automated extraction can help you quickly obtain data from SEC reports which in turn saves you an enormous amount of time — giving you quicker access to critical information for analysis and you do not have to worry about any backlogs waiting for data.

Accuracy: Automation reduces manual entry that can result in human data input errors, thereby ensuring data consistency and adherence to reporting standards.

Timeliness: Real time or near real time processing access to fresh information from SEC filings can speed up responses to market events or regulatory updates.

Compliance: It makes SEC reporting requirements more comfortable for entities, ensuring that appropriate financial data is submitted on time and in the proper format. This not only majorly solves for the timeliness of the data as mentioned above but also reduces demand for Uncertainty.