Skip to content

Adding Support for Additional File Types in ExpressionAble

frytime32 edited this page Apr 24, 2019 · 4 revisions

Intro

Currently, ExpressionAble supports working with files in the following formats: CSV, TSV, JSON, Excel, HDF5, Parquet, MsgPack, Stata, Pickle, HTML, SQLite, ARFF, Salmon, Kallisto, Jupyter notebook, RMarkdown, and GCT. This page explains what steps must be taken to expand ExpressionAble to work with other types of files.

Getting Started

If you are unfamiliar with object-oriented programming, classes, and inheritance in Python, your time would be well spent working through a few tutorials before you get going. I personally recommend this tutorial and these interactive exercises to help you out. Since you will need to implement your own classes that inherit from and use preexisting code, this README will make much more sense if you are comfortable with those principles.

To get started, you will need to create your own fork of the ExpressionAble repository. A fork will be your own personal copy of the ExpressionAble code that you will work on, and it remains independent of whatever happens in the original. Click on the "Fork" button as shown below:

Fork button on Github

Then, you will need to clone your fork of the ExpressionAble repository to have access to its files. Press the "Clone or download" button as shown below to display the URL you will use to clone the repository.

Clone button on Github

Copy the URL, then enter in the following commands at the command line, replacing <URL> with the URL you just copied from GitHub:

git clone <URL>
cd ExpressionAble

Now, we set up the master branch of your forked repository to track the original's master branch. This way, you can keep your master branch up to date if things change in the original. Do this with the following command:

git remote add upstream https://github.com/srp33/ExpressionAble.git

Or, if you have previously done this, make sure your fork is up to date by pulling from the original's master branch:

git pull upstream master

Now that you have your own fork set up and up to date, you can create your own branch where you will work on your features. You should never need to work on the master branch. Create a new branch on your copy of the git repository (see below). Replace new-branch-name with the name of the file type you will be implementing:

git checkout -b new-branch-name

Important note on working on multiple file types

If you are going to work on supporting multiple file types, always keep them in separate branches! If you finish working on one file type, make a new branch off of master before starting on another. This makes it much easier to merge your work with the base master branch later.

Now you are ready to use ExpressionAble code and begin work on your file type!

Adding a file type to ExpressionAble

Begin by adding a new file to ExpressionAble/expressionable/files. The file name should be all lowercase and indicate the file type you are supporting.

File structure

For example, if I were creating a class for supporting the ARFF format, I would name my file arfffile.py. All ExpressionAble-supported file types are associated with a class that inherits from EAFile. Make sure the class you are building properly inherits from EAFile. For example, If I were writing a class to support the ARFF file type, my class declaration would look like this:

from ..files import EAFile
class ARFFFile(EAFile):
    ...

If I were implementing support for reading in a file of my chosen type to ExpressionAble, I would at a minimum override and implement this function:

def read_input_to_pandas(self, columnList=[], indexCol=None)

If I were implementing support for writing/exporting data to a file of my chosen type from ExpressionAble, I would at a minimum override and implement this function:

def write_to_file(self, df, gzipResults=False, includeIndex=False, null='NA', indexCol=None, transpose=False)

Function Details

def read_input_to_pandas(self, columnList=[], indexCol=None)

This function must provide means for reading your desired file type stored at the location self.filePath into a Pandas data frame. This function must return a Pandas data frame that contains the information stored in the file. If passed a list of desired columns columnList, this function should return a Pandas data frame containing the data on your file only for the selected columns. If the list of columns is empty, it should return the entire data set from the file in a Pandas data frame. Note: the returned data frame should not have an index (besides the default index). If necessary, reset the index using df.reset_index(inplace=True), and do not worry about the parameter indexCol. The file may be gzipped, which can be checked using self.isGzipped. One way to read in a gzipped file is to temporarily unzip it using EAFile._gunzip_to_temp_file() and then delete the temporary file. If I were reading from an ARFF file, and I had written a function arffToPandas(filePath) that takes an ARFF file and puts it in a Pandas data frame, the code might look like the example below. In your code, you would replace arffToPandas with a function or some code that reads your file into a Pandas data frame:

    def read_input_to_pandas(self, columnList=[], indexCol=None):
        if self.isGzipped:
            tempFile = super()._gunzip_to_temp_file()
            #read the unzipped tempfile into a dataframe using YOUR function
            df = arffToPandas(tempFile.name)
            #delete the tempfile
            os.remove(tempFile.name)
        else:
            df = arffToPandas(self.filePath)
        #reduce the dataframe to only the requested columns
        if len(columnList) > 0:
            df = df[columnList]
        return df
def write_to_file(self,df, gzipResults=False, includeIndex=False, null='NA')

This function must provide means for writing data stored in a Pandas data frame to the location stored at self.filePath. If gzipResults is True, the file created should be gzipped. One way to do this is to export the file to a temporary file, and then gzip that file. To do this, you can create a tempfile.NamedTemporaryFile(delete=False), write your data frame to that file path, close the temporary file, and then use EAFile._gzip_results() to gzip that temporary file to your desired file path. If I were writing to a GCT file, the code for this function might look like the example below. In your code, you would replace toGCT() with a function or code you wrote that writes your data frame to your file type:

import tempfile
    def write_to_file(self, df, gzipResults=False, includeIndex=False, null='NA', indexCol=None, transpose=False):
        if gzipResults:
            tempFile = tempfile.NamedTemporaryFile(delete=False)
            toGCT(df, tempFile.name)
            tempFile.close()
            super()._gzip_results(tempFile.name, self.filePath)
        else:
            toGCT(df, self._remove_gz(self.filePath))

includeIndex indicates whether or not the index column should be written to the file or not. Whether it should or not will depend on your own implementation and whether or not you want Pandas' default index stored in your file. null is an optional parameter which indicates how None should be represented in your file. indexCol is the name of the column that should be the index in the data frame. transpose is a boolean indicating whether or not the data frame had been previously transposed from its original state.

It is worth noting that most Pandas data frames have a default index that numerically labels rows 1, 2, 3, and so on. Unless your specific format requires it, we do not want the default index exported to your file because the default index was not originally part of the data set. The general exception to this rule seems to be if the data has been transposed; then, the default index should be written. When writing to some file types, it may help to insert the following code at the beginning of the write_to_file function:

if not transpose:
    df = df.set_index(indexCol) if indexCol in df.columns else df

This code replaces the data frame's default index with the column indicated by indexCol, assuming the data was not transposed. Doing this may make it easier to avoid writing the default index to your file. Whether or not you need to do this is dependent on your specific implementation, but it is worth considering.

Connecting Your File Type to ExpressionAble

In addition to implementing the class for you file type, you must hook your file type's class into ExpressionAble so it can use it. You need to add a clause to EAFile.factory(), a function found in ExpressionAble/expressionable/files/eafile.py, that will be used to construct an file object of your type. The type parameter is a string that corresponds to the name of your file type. If such a string is given, you should then return a file object of your type with the given filePath and type. In order for this to work, you will also need to add an import statement in the factory method of the file eafile.py. If I were adding support for a GCT file, my code would look like the example below. You should of course replace 'gct' and GCTFile.GCTFile() with your extension and file constructor, respectively:

def factory(filePath, type):
    from ..files import GCTFile

    if type.lower() ==......
    ...
    elif type.lower() == 'gct': return GCTFile(filePath,type)

A clause should be added in the EAFile.__determine_extension() function that indicates what file extension or extensions correspond to your file type. This method should return the name of your file type when a file name has the extension related to your file type. The purpose of this function is to enable ExpressionAble to infer file types based on file extensions. If I were adding support for Parquet files, whose file extension is '.pq', my code would look like this:

def __determine_extension(fileName):
        ...
        if extension == ...
        ...
        elif extension == 'pq':
            return 'parquet'

Finally, a small addition must be made to ExpressionAble/expressionable/files/__init__.py. At the top of the file, add the name of the class you wrote to the list titled __all__ as shown below:

__all__ = ['EAFile', 'ARFFFile', 'CSVFile', 'ExcelFile', 'GCTFile', 'HDF5File', 'HTMLFile', 'JSONFile', 'JupyterNBFile',
           'KallistoEstCountsFile', 'KallistoTPMFile', 'MsgPackFile', 'ParquetFile', 'PickleFile', 'RMarkdownFile',
           'SalmonNumReadsFile', 'SalmonTPMFile', 'SQLiteFile', 'StataFile', 'TSVFile', 'FWFFile']

Then, add an import statement that imports the class you wrote. Below is an example of such a statement for importing an ExcelFile:

from expressionable.files.excelfile import ExcelFile

Here is an image of what __init__.py looks like: Image of init.py

Adding new dependencies

If you introduced any new dependencies or libraries to ExpressionAble (anything using an import statement that you had to install first), we need to add it to ExpressionAble/setup.py. Remember, just because it works on your machine doesn't mean it will work on everyone else's when installed! By adding dependencies to setup.py, it guarantees that the necessary dependency will be installed when ExpressionAble is. When you open setup.py, it will look like this:

setup.py

On the line that says "install_requires" there is a list of comma-separated strings, each of which is the name of a dependency that ExpressionAble needs. To add your dependency, simply add an entry to the list in quotes with the name of the library.

Adding Necessary Tests

In order to determine whether your file class properly works with ExpressionAble, it will need to pass tests that check if the read_input_to_pandas and write_to_file work properly. If you are only supporting reading your file type into ExpressionAble, only follow instructions for "Tests for reading files". If you are only supporting exporting to your file type from ExpressionAble, only follow instructions for "Tests for writing to files". These tests will be run every time code is committed to GitHub to ensure that new code does not break previously-working code.

Tests for reading files

First, create a file of your type that is equivalent to this TSV file. This preferably should be done by hand to ensure accuracy. This file must be named input.tsv, except you should replace the extension tsv with the appropriate extension for your file type.

You will also need to provide a gzipped version of this same file. It must be named gzipped.tsv.gz, except you should replace the middle extension tsv with the appropriate extension for your file type.

Now that you have the files created, they must be placed in the appropriate testing folder. Move your input file into the folder Tests/InputData/InputToRead. Move your gzipped file to Tests/InputData/GzippedInput.

In order for the test script to check your files for accuracy, you must add a small item to the file RunTests.sh Near the top of the file is a list declaration for extensionsForReading that lists all the file extensions for file types being tested for reading. Add your file type's extension, in quotes, to the list, as shown below.

Adding extensions to test

Now when the testing suite is run, it will include tests for reading your file and its gzipped version.

Tests for writing to files

If you have not yet done so, create a file of your type that is equivalent to this TSV file. This preferably should be done by hand to ensure accuracy. This file must be named input.tsv, except you should replace the extension tsv with the appropriate extension for your file type.

Place this input file that you have created into the folder Tests/OutputData/WriteToFileKey.

In order for the test script to check your files for accuracy, you must add a small item to the file RunTests.sh Near the top of the file is a list declaration for extensionsForWriting that lists all the file extensions for file types being tested for writing. Add your file type's extension, in quotes, to the list, as shown below.

Adding extensions to test

Now when the testing suite is run, it will include tests for writing to your file type.

Running the automated tests

To run the suite of automated tests in a controlled environemnt, enter the following command into the terminal from the root of the ExpressionAble project:

bash build_docker

The script will alert you if tests fail and why they failed. Note that this testing suite is testing operations across ALL file types, and not just yours. If all the tests pass, you are ready to submit a pull request and officially integrate your code into ExpressionAble!

Submitting a pull request

Add, commit, and push your changes to the branch that you created earlier. Replace message with a brief messages that describes the work you have done. Replace new-branch-name with the name of the branch you created previously:

git add --all
git commit -m "message"
git push origin new-branch-name

Go here to create a GitHub pull request. Under "Comparing changes", click the blue text "compare across forks". Ensure the base fork is "srp33/ExpressionAble/" and the base branch is "master". For the head fork, select the fork you have been working on (it is easily identified by your GitHub username). For the compare branch, select the branch that you have been working on and that is up to date with your most recent changes. Leave a comment explaining the work you've done, then click on "Create pull request". We will then check to make sure your code is working properly. If it is, we will integrate your code into the ExpressionAble repository.