A newspaper editor was researching immigration data trends on H1B(H-1B, H-1B1, E-3) visa application processing over the past years, trying to identify the occupations and states with the most number of approved H1B visas. She has found statistics available from the US Department of Labor and its Office of Foreign Labor Certification Performance Data. But while there are ready-made reports for 2018 and 2017, the site doesn’t have them for past years.
As a data engineer, you are asked to create a mechanism to analyze past years data, specificially calculate two metrics: Top 10 Occupations and Top 10 States for certified visa applications.
Your code should be modular and reusable for future. If the newspaper gets data for the year 2019 (with the assumption that the necessary data to calculate the metrics are available) and puts it in the input
directory, running the run.sh
script should produce the results in the output
folder without needing to change the code.
The directory structure for the repo is of following format :
├── README.md
├── run.sh
├── requirements.txt
├── .travis.yml
├── src
│ └──H1BDataFrame.py
│ └──DataTransformer.py
│ └──AnalyticsEngine.py
│ └──runner.py
│ └──resources
│ └── 2008fileformat.json
│ └── 2009fileformat.json
├── tests
│ └──test_dataframe.py
│ └──test_analyticsengine.py
│ └──test_datatransformer.py
│ └──context.py
├── input
│ └──h1b_input.csv
├── output
| └── top_10_occupations.txt
| └── top_10_states.txt
├── insight_testsuite
└── run_tests.sh
└── tests
└── test_1
| ├── input
| │ └── h1b_input.csv
| |__ output
| | └── top_10_occupations.txt
| | └── top_10_states.txt
├── your-own-test_1
├── input
│ └── h1b_input.csv
|── output
| | └── top_10_occupations.txt
| | └── top_10_states.txt
Data Handling
:H1BDataFrame
hosts methods for reading csv files, accessing data and performing generic operations on the dataframe, similar topandas
.Data Pre-Processing
:DataTransformer
currently hosts methods for renaming the raw input column names to a generic convention. Currently, the latest file structure is used as a standard. The files dated2009
and before have different naming convention, hence the required mapping to transform from old convention to the latest is included as json files undersrc/resources
.Analytics
:AnalyticsEngine
hosts methods for performing basic analysis on the dataset, such as calculating thetop 10 statistics
.
Place the input file as ./input/h1b_input.csv
and run the run.sh
script.
- Install the dependencies by running
pip install -r requirements.txt
- run
pytest
command for the root of the project../H1B-Analytics/