This repo contains the workflow.factset
R package, a Dockerfile to build an image containing that package and its dependencies, and an Azure ARM template to deploy that image, along with factset_data_loader and a PostgreSQL database.
QUICKSTART: See "Deploy", below.
To deploy main
branch with a databse and running FactSet Data Loader:
az deployment group create \
--resource-group "RMI-SP-PACTA-DEV" \
--template-file azure-deploy.with-db.json \
--parameters azure-deploy.with-db.rmi-pacta.parameters.json
To deploy main
branch Extracting PACTA files from an existing FDS
database:
az deployment group create \
--resource-group "RMI-SP-PACTA-DEV" \
--template-file azure-deploy.extract-only.json \
--parameters azure-deploy.extract-only.rmi-pacta.parameters.json
The workflow.factset
package's purpose is to extract data from a database prepared by the FactSet DataFeed Loader application.
For more information on that application, and a docker container to use is efficiently, please see RMI-PACTA/factset_data_loader.
The primary callable function in the package is export_pacta_files()
, which serves as a wrapper around more-targeted functions which download data from the database in a format expected by pacta.data.preparation
.
export_pacta_files()
then writes those in-memory tables to files.
Databases other than PostgreSQL may work, but are not tested.
The default values to control behavior of export_pacta_files()
(and the related connect_factset_db()
) are controlled by OS Environment variables, but use the standard R argument system if you are including these functions in a flow other than the one implemented in the Docker image in this repo.
DATA_TIMESTAMP
: String. Date to use for share prices, fund holdings, and entity financing information. Exact format flexible, but should be parseable bylubridate::ymd_hms()
ISS_REPORTING_YEAR
: Integer. Year to use for ISS reporting data.DEPLOY_START_TIME
: String, optional (defaultSys.time()
). Provides identifier to distinguish between dataset pulled from database at different times. Suggested format:"%Y%m%dT%H%M%SZ"
(this is the default format used by the ARM template described below).EXPORT_DESTINATION
: Filepath. Directory where exported PACTA data files are saved.export_pacta_files()
will create a new directory underneath this path.LOG_LEVEL
: String, optional (default"INFO"
). Not actually used by any of the functions in this package, but used in docker image to set logging verbosity of the package. Suggested values:"INFO"
,"DEBUG"
,"TRACE"
.PGDATABASE
: String. Database name containing FactSet tables.PGHOST
: Hostname. Server address for FactSet database.PGPASSWORD
: String. Password for connection to DB.PGPORT
: Integer, optional (default5432
). Port on DB server to access PostgreSQL.PGUSER
: String. PostgreSQL username. User must have READ access to tables.UPDATE_DB
:true
/false
, optional (defaultfalse
). Delay main function body until a file ($WORKINGSPACEPATH/done_loader
) exists. Useful as part of ARM template, when this repo's docker image runs simultaneously with the RMI-PACTA/factset_data_loader image, which may take hours to finish preparing the database.WORKINGSPACEPATH
: Filepath. Used when$UPDATE_DB
istrue
.
The Docker image defined in Dockerfile
contains all required dependencies for interacting with PostgreSQL
databases, and have been tested against Azure Flexible Server for PostgreSQL (See ARM Template, below).
Note: Setting the LOG_LEVEL
Environment Variable for the docker container is useful for controlling the verbosity of the logs.
The image builds automatically by GitHub actions, and hosted publicly on ghcr.io
.
See "Packages", to the right.
The main
tag should be used, rather than latest
.
azure-deploy.json
is an Azure ARM Template which by default deploys:
- Azure Flexible Server for PostgreSQL
fds
database on that server- Firewall rule allowing access to the server from Azure (such as the containers)
factset_data_loader
docker image (image fromghcr.io
)workflow.factset
docker image (image fromghcr.io
, this repo)
Alternately, the deploy template will only deploy the docker container from this repo's image if the updateDB
parameter is set to false
(see "Parameters and Variables", below).
- (Local)
az
CLI: available here - Storage Account: A storage account must exist, with File Shares (referenced in the existing ARM template as
"factset-loader"
and"factset-extracted"
).- File Share
factset-loader
: see RMI-PACTA/factset_data_loader for more information on expected structure - File Share
factset-extracted
: No structure expected.export_pacta_files()
will create subdirectories in this file share.
- File Share
- Managed Identity: An Azure managed identity must exist with read/write (contributor) permissions for the fileshares. This is the identity that the containers will run with.
ARM Template Parameters can be set at deploy-time to control the properties of the deployed resources.
The azure-deploy.json
template in this repo makes use of them to pass information to the containers and database.
All parameters must have values, but most have sensible defaults already defined in the template, and the rest have example values defined in azure-deploy.example.parameters.json
.
PGHOSTOverride
: IfupdateDB
isfalse
, then this specifies the value of$PGHOST
environment variable thatworkflow.factset
will connect to.PGPASSWORD
: Database Server PasswordcontainerGroupName
: Label to define the container group name, and by default the DB server name will have this appended with-postgres
. Does not affect behavior of container, but note for later, so resources can be deleted/managed via Azure Portal or throughaz
.dataTimestamp
: Passed to containers as $DATA_TIMESTAMP environment variable.identity
: See "Identity" in "Prerequisites", above.imageTagLoader
: (default:main
) tag used forfactset_data_loader
image fromghcr.io
imageTagWorkflow
: (default:main
) tag used forworkflow.pacta
image fromghcr.io
loaderDBBackup
: (default: 1) Shouldfactset_data_loader
pg_dump the database after updating? Empty string disables backup.loaderDBRestore
: (default: 1) Shouldfactset_data_loader
pg_restore the database before updating? Empty string disables restore.loaderFactsetSerial
: Subscription serial number (provided by FactSet)loaderFactsetUsername
: Subscription username (provided by FactSet)location
: (default: same as Resource Group) What Azure zone should resources be deployed in?restartPolicy
: See Azure Documentationstarttime
: (default:utcNow()
) Provides useful unique identifier to distinguish between datasets.storageAccountKeyRawdata
: Account key to Azure Storage Account (see "Prerequisites").updateDB
:true
orfalse
(default:true
). Iffalse
, does not deploy DB orfactset_data_loader
container, andexport_pacta_files()
does not wait for signal from loader before attempting to extract data. Useful when targeting an existing populated database.
Variables in ARM templates are populated at deploy-time (can be influenced by parameters), but can only be edited in the ARM template itself. Many of the variables used in this template are detailed in the factset_data_loader repo.
Key variables to be aware of:
containerregistry
: Defines the container registry images will be pulled from. If using a private registry, be sure to check the ARM template documentation about registry credentialsdbInstanceType
: Defines the machine specifications for DB serverdbSkuSizeGB
: In testing, even with "AutoGrow" enabled, values less than 256 GB caused out of space errors when running in productionmachineCpuCoresLimit*
: Maximum CPU cores available to each container. Note that currently ACI container groups are limited to a maximum of 4 cores (across all containers).machineCpuCoresRequest*
: Cores allocated to each container. Note that for the Loader container, this also affect the parallelization level of the loader (and overall run time).machineMemoryInGBLimit*
: Maximum memory available to each container. Note that currently ACI container groups are limited to a maximum of 16GB (across all containers).machineMemoryInGBRequest*
: Memory requested by each container. Actual usage may fall below this, and be used by other containers (up to their limit)mountPathExport
: Path in container to export files to
Optional: Create a parameters file (azure-deploy.example.parameters.json
serves as a template) for parameters that do not have a default.
If you do not create this file, then the deploy process will prompt for values.
A parameter file with the values that the RMI-PACTA team uses for extracting data is available at azure-deploy.rmi-pacta.parameters.json
.
# run from repo root
# change this value as needed.
RESOURCEGROUP="RMI-SP-PACTA-DEV"
# To deploy with DB and FDSLoader:
az deployment group create \
--resource-group "$RESOURCEGROUP" \
--template-file azure-deploy.with-db.json \
--parameters azure-deploy.with-db.rmi-pacta.parameters.json
# To deploy without DB and FDSLoader:
az deployment group create \
--resource-group "$RESOURCEGROUP" \
--template-file azure-deploy.extract-only.json \
--parameters azure-deploy.extract-only.rmi-pacta.parameters.json
For security, the RMI-PACTA parameters file makes heavy use of extracting secrets from an Azure Key vault, but an example file that passes parameters "in the clear" is available as azure-deploy.example.parameters.json
Non RMI-PACTA users can define their own parameters (examples in the azure-deploy.example.parameters.json
file) and invoke the appropriate ARM Template
Note that the image supports amd64
platforms.
If you are running on an arm64
machine (Apple Silicon), you may need to change the preferrred build platform with:
export DOCKER_DEFAULT_PLATFORM=linux/amd64
To build the image, you can use the standard mechanism of docker build .
.
To build and test in one step, you can alternately use docker-compose up --build
Partial (manual) local testing is possible via docker-compose
.
Currently get_issue_code_bridge()
is the sole function with necessary testing infrastructure.
Testing Steps:
docker-compose up
# in another terminal:
docker attach workflowfactset-workflow.factset-1 # enters the workflow.factset container
This enters the container, which is running R in an interactive session
#in that container
library(workflow.factset)
conn <- connect_factset_db()
issue_code_bridge <- get_issue_code_bridge(conn = conn)
issue_code_bridge
From here, you can exit R as usual (q()
), and then turn off the database container with:
docker-compose down --volumes
The files exported by workflow.factset::export_pacta_files()
are:
Column Name | Column Type | Example Content | Description |
---|---|---|---|
fsym_id | chr | "XXXXXX-R" | FactSet identifier for security |
date | date | 2022-12-31 | date of balance sheet data |
currency | chr | "USD" | currency for balance sheet data |
ff_mkt_val | dbl | 2000000 | Market Value - based on latest closing price and monthly shares |
ff_debt | dbl | 1000000 | Total debt |
fsym_company_id | chr | "XXXXXX-S" | fsym_id connecting to FactSet Fundamentals dataset |
factset_entity_id | chr | "XXXXXX-E" | FactSet identifier for an entity |
Column Name | Column Type | Example Content | Description |
---|---|---|---|
factset_entity_id | chr | "XXXXXX-E" | FactSet identifier for an entity |
entity_proper_name | chr | "FooBar, Inc." | Entity common name, normalized and in proper case |
iso_country | chr | "US" | 2 letter country code for domicile |
sector_code | chr | "6000" | 4 digit code for FactSet sector classification |
factset_sector_desc | chr | "Miscellaneous" | FactSet description for sector |
industry_code | chr | "6005" | 4 digit code for FactSet industry classification |
factset_industry_desc | chr | "Miscellaneous" | FactSet description for industry |
credit_parent_id | chr | "XXXXXX-E" | FactSet entity ID for credit parent |
ent_entity_affiliates_last_update | chr | "2023-12-21T22:35:27Z" | Timestamp for last update of ent_entity_affiliates table |
Column Name | Column Type | Example Content | Description |
---|---|---|---|
fsym_id | chr | "XXXXXX-S" | FactSet identifier for financial instrument |
isin | chr | "XX0000000001" | ISIN for instrument |
factset_entity_id | chr | XXXXXX-E | FactSet identifier for an entity |
adj_price | dbl | 100.5 | Adjusted Share price |
adj_shares_outstanding | dbl | NA | Adjusted number of shares outstanding |
issue_type | chr | NA | Share Type |
one_adr_eq | dbl | NA | Number of shares equivilent to one ADR |
Column Name | Column Type | Example Content | Description |
---|---|---|---|
factset_fund_id | chr | "FFFFFF-E" | FactSet identifier for fund |
fund_reported_mv | dbl | 100000000 | Total reported Market Value |
holding_isin | chr | "XX0000000002" | ISIN held in fund |
holding_reported_mv | dbl | 100000 | Market value of ISIN held in fund |
report_date | date | 2023-12-31 | report date for holding |
Column Name | Column Type | Example Content | Description |
---|---|---|---|
isin | chr | "XX0000000001" | ISIN |
fsym_id | chr | "XXXXXX-S" | FactSet identifier for financial instrument |
factset_fund_id | chr | "FFFFFF-E" | FactSet identifier for fund |
Column Name | Column Type | Example Content | Description |
---|---|---|---|
factset_entity_id | chr | "XXXXXX-E" | FactSet identifier for an entity |
icc_total_emissions | dbl | 123.4 | Total emissions for entity |
icc_scope_3_emissions | dbl | 123.4 | Scope 3 emissions for entity |