Problem

A newspaper editor was researching immigration data trends on H1B(H-1B, H-1B1, E-3) visa application processing over the past years, trying to identify the occupations and states with the most number of approved H1B visas. She has found statistics available from the US Department of Labor and its Office of Foreign Labor Certification Performance Data. But while there are ready-made reports for 2018 and 2017, the site doesn’t have them for past years.

As a data engineer, you are asked to create a mechanism to analyze past years data, specificially calculate two metrics: Top 10 Occupations and Top 10 States for certified visa applications.

Your code should be modular and reusable for future. If the newspaper gets data for the year 2019 (with the assumption that the necessary data to calculate the metrics are available) and puts it in the input directory, running the run.sh script should produce the results in the output folder without needing to change the code.

Approach

The directory structure for the repo is of following format :

      ├── README.md
      ├── run.sh
      ├── requirements.txt
      ├── .travis.yml
      ├── src
      │   └──H1BDataFrame.py
      │   └──DataTransformer.py
      │   └──AnalyticsEngine.py
      │   └──runner.py
      │   └──resources
      │       └── 2008fileformat.json
      │       └── 2009fileformat.json
      ├── tests
      │   └──test_dataframe.py
      │   └──test_analyticsengine.py
      │   └──test_datatransformer.py
      │   └──context.py
      ├── input
      │   └──h1b_input.csv
      ├── output
      |   └── top_10_occupations.txt
      |   └── top_10_states.txt
      ├── insight_testsuite
          └── run_tests.sh
          └── tests
              └── test_1
              |   ├── input
              |   │   └── h1b_input.csv
              |   |__ output
              |   |   └── top_10_occupations.txt
              |   |   └── top_10_states.txt
              ├── your-own-test_1
                  ├── input
                  │   └── h1b_input.csv
                  |── output
                  |   |   └── top_10_occupations.txt
                  |   |   └── top_10_states.txt

Code Pipeline

Data Handling : H1BDataFrame hosts methods for reading csv files, accessing data and performing generic operations on the dataframe, similar to pandas.
Data Pre-Processing : DataTransformer currently hosts methods for renaming the raw input column names to a generic convention. Currently, the latest file structure is used as a standard. The files dated 2009 and before have different naming convention, hence the required mapping to transform from old convention to the latest is included as json files under src/resources.
Analytics : AnalyticsEngine hosts methods for performing basic analysis on the dataset, such as calculating the top 10 statistics.

Run Instructions

Running code

Place the input file as ./input/h1b_input.csv and run the run.sh script.

Running unit-tests

Install the dependencies by running pip install -r requirements.txt
run pytest command for the root of the project. ./H1B-Analytics/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Problem

Approach

Code Pipeline

Run Instructions

Running code

Running unit-tests

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
input		input
insight_testsuite		insight_testsuite
output		output
src		src
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

rvsandeep/H1B-Analytics

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Problem

Approach

Code Pipeline

Run Instructions

Running code

Running unit-tests

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages