This repo involves extracting mentions of each U.S. state from a company's 10-K filings for a specific year. For instance, if a 10-K filing for Target in 2020 mentions "California", the goal is to capture the word count for "California" and similarly for the other 49 states. This analysis is performed for a sample of firms.
- 10_k_data_mining.py: The main script that handles downloading 10-K filings, cleaning the HTML content, and counting state mentions.
- state_mentions_counts.csv: The output file that contains the count of state mentions for each CIK.
- Python 3.x
- sec_edgar_downloader
- pandas
- BeautifulSoup
- lxml
- html5lib
Install the required packages using:
pip install sec_edgar_downloader pandas beautifulsoup4 lxml html5lib
The script first sets up a directory to store downloaded 10-K files and then uses the SEC EDGAR Downloader to fetch the 10-K filings based on provided CIK codes and years.
download_dir = '../Temp'
create_directory(download_dir)
email_address = "[email protected]"
dl = Downloader(download_dir, email_address=email_address)
# Read the Excel file
file_path = 'file path' # Replace with the path to your Excel file
df = pd.read_excel(file_path)
cik_years = df[['cik_x', 'year']]
for index, row in cik_years.iloc[:].iterrows():
try:
cik_number = str(row['cik_x'])
year = int(row['year'])
dl.get("10-K", cik_number, after=f"{year}-01-01", before=f"{year}-12-31")
print(f"Downloaded 10-K for CIK: {cik_number}, Year: {year}")
except Exception as e:
print(f"Failed to download 10-K for CIK: {cik_number}, Year: {year}. Error: {e}")
print("Download completed.")
After downloading the files, the script cleans the HTML content to remove unwanted tags and tables.
def clean_html_content(html_content):
# Parse the HTML content
try:
soup = BeautifulSoup(html_content, "html.parser")
except Exception as e:
print(f"html.parser failed: {e}")
try:
soup = BeautifulSoup(html_content, "lxml")
except Exception as e:
print(f"lxml failed: {e}")
try:
soup = BeautifulSoup(html_content, "html5lib")
except Exception as e:
print(f"html5lib failed: {e}")
raise
...
return cleaned_text
def process_files(base_directory):
file_paths = glob.glob(os.path.join(base_directory, '**', '*.txt'), recursive=True)
for file_path in file_paths:
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
cleaned_text = clean_html_content(content)
with open(file_path, 'w', encoding='utf-8') as file:
file.write(cleaned_text)
print(f"Processed and cleaned: {file_path}")
base_directory = 'sec-edgar-filings'
process_files(base_directory)
The script counts mentions of each U.S. state in the cleaned 10-K files and compiles the results into a CSV file.
import os
import re
import glob
import pandas as pd
us_states = [
"Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
...
]
def count_state_mentions(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
state_counts = {state: len(re.findall(rf'\b{state}\b', text, re.IGNORECASE)) for state in us_states}
return state_counts
def process_files_for_state_mentions(base_directory):
file_paths = glob.glob(os.path.join(base_directory, '**', 'full-submission.txt'), recursive=True)
all_counts = []
for file_path in file_paths:
cik_code = os.path.basename(os.path.dirname(file_path)) # Extracts CIK code from directory name
state_counts = count_state_mentions(file_path)
state_counts['CIK'] = cik_code
all_counts.append(state_counts)
df = pd.DataFrame(all_counts)
columns_order = ['CIK'] + us_states
df = df[columns_order]
return df
base_directory = '/content/sec-edgar-filings'
state_mentions_df = process_files_for_state_mentions(base_directory)
output_file = 'state_mentions_counts.csv'
state_mentions_df.to_csv(output_file, index=False)
print(f"State mentions counts have been saved to {output_file}")
- Ensure that the paths specified in the script are correctly set to where your files are located.
- The script reads an Excel file containing CIK codes and years to fetch the corresponding 10-K filings.
- The output CSV file
state_mentions_counts.csv
contains the count of mentions for each state across different 10-K filings.
For any issues or inquiries, please contact [email protected].