Skip to content
AlvaroMarquesAndrade edited this page Jun 4, 2020 · 9 revisions

Welcome to the Butterfree wiki!

The main idea is for this repository to be a set of tools for easing ETLs. The idea is using Butterfree to upload data to a Feature Store, so data can be provided to your machine learning algorithms.

Table of Contents

What is going on here

Besides introducing Butterfree itself, it's necessary to define some concepts that will be presented throughout this Wiki.

The feature store is where features for machine learning models and pipelines are stored. A feature is an individual property or characteristic of a data-sample, such as the height of a person, the area of a house or an aggregated feature as the average prices of houses seen by a user within the last day. Finally, a feature set can be thought as a set of features.

This repository holds all scripts that will extract data from necessary sources (S3 and Kafka, for instance), transform all of this raw data into feature sets and, finally, upload these results in a feature store.

Scripts use Python and Apache's Spark.

Service Architecture

Extract

Basically, the extract step is performed with a Source object, by defining the desired data sources. You can learn more about it in the corresponding Wiki page.

Transform

The transform step is defined within a FeatureSet, by explicit defining the desired transformations. More information about the transformations can be found at the transform Wiki page.

Load

It's the last step of the ETL process, and it's defined by a Sink object. Please, refer to the sink Wiki page to know more.

Clone this wiki locally