- The final project should represent significant original work applying data science techniques to an interesting problem.
- Final projects are individual attainments, but you should be talking frequently with your instructor and classmates about them.
- Address a data-related problem in your professional field or a field you're passionate about
- If you have a strong interest in the subject matter, you'll create a better project that will be a lot more fun for you!
Here's a collection of past projects from GA Data Science students that may help to stimulate your thinking. You're welcome to use public data or private data, though with private data, you'll have to be careful about what you release. Competing in a Kaggle competition (including past competitions) is also a project option, in which case the data will be provided for you.
- March 29: Discuss project ideas with instructional team
- Past student projects
- Public data sources
- Data science competitions: Kaggle, DrivenData, CrowdANALYTIX, TunedIT, InnoCentive
- Mar 31: Project question and dataset
- Project question by Corinne Fukayama
- Project question by Alex Kapitanskaya
- April 21: First project presentation
- First presentation by Corinne Fukayama
- April 28: Draft paper
- May 10: Peer review
- May 19: Final project presentation and paper
- Final paper by Mike Yea
By November 10, you should talk with a member of the instructional team about your project idea(s). We can help you to choose between different ideas, advise you on the appropriate scope for your project, and ensure that your project question might reasonably be answerable using the data science tools and techniques taught in the course. (There is nothing you have to turn in for this milestone.)
Create a GitHub repository for your project. It should include a short write-up that answers these questions:
- What is the question you hope to answer?
- What data are you planning to use to answer that question?
- What do you know about the data so far?
- Why did you choose this topic?
- Clearly defined: The question can easily be summarized in a single statement ("Can we predict A based on B?").
- As simple as possible: The question has a narrow focus rather than broad goals.
- Reasonably available data: The question depends on data that is likely to be available in a "meaningful" quantity.
- Reasonable hypothesis: The question examines factors (B) that might actually be predictive of the outcome (A).
You'll be giving a short presentation to the class about the work you have done so far, as well as your plans for the project going forward. Your presentation should use slides (or a similar format). Your slides, code, data, and visualizations should be included in your GitHub repository. Here are some questions that you should address in your presentation:
- What data have you gathered, and how did you gather it?
- Which areas of the data have you cleaned, and which areas still need cleaning?
- What steps have you taken to explore the data?
- What insights have you gained from your exploration?
- Will you be able to answer your question with this data, or do you need to gather more data (or adjust your question)?
- How might you use modeling to answer your question?
- Please submit a link to your repository (with slides) no later than 6pm on Thursday. I'll be copying your slides to my computer before class begins. Please don't Slack your materials to me unless you are having problems with GitHub.
- Everyone will be presenting from my computer, so your slides should be in a format that can be easily read on any computer (PDF, PowerPoint, Google Slides, IPython Notebook).
- You will have exactly 6 minutes to present, followed by 3 minutes of questions.
- Make sure your project question is crystal clear to every person in the room in the first minute.
- Tell your story in an engaging fashion. Come up with a story or example that will help the audience to relate to your topic.
- It is critical that you practice delivering your presentation and time yourself.
- If you find that your presentation is longer than 10 minutes, the solution is not to speak more quickly. Instead, focus your presentation around the most interesting aspects of your project.
If it's not practical to include your entire dataset in your GitHub repository, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository.
A draft of your project paper is due, along with the data, well-commented code, and visualizations. It should be written with a technical audience in mind. Your paper should include the following components:
- Problem statement and hypothesis
- Description of your data set and how it was obtained
- Description of any pre-processing steps you took
- What you learned from exploring the data, including visualizations
- How you chose which features to use in your analysis
- Details of your modeling process, including how you selected your models and validated them
- Your challenges and successes
- Possible extensions or business applications of your project
- Conclusions and key learnings
Your peers and instructional team will be providing feedback. However, the paper should stand "on its own", and should not depend upon the reader remembering your first presentation. The easier your paper is to follow, the more useful feedback you will receive! As well, if your reviewers can actually run your code on the provided data, they will be able to give you better feedback on your code.
You will provide project feedback to two of your peers, according to the peer review guidelines.
Your project repository on GitHub should contain the following:
- Project paper: any format (PDF, Markdown, etc.)
- Presentation slides: any format except for Keynote (PDF, PowerPoint, Google Slides, IPython Notebook, etc.)
- Code: commented Python scripts, and any other code you used in the project
- Visualizations: integrated into your paper and/or slides
- Data: data files in "raw" or "processed" format
- Data dictionary (aka "code book"): description of each variable, including units
- Please submit a link to your repository (with slides) by 6pm on the day you are presenting.
- Regardless of which day you are presenting, your repository should also contain the other required project components by 6pm on the last day of class.
- You will have exactly 20 minutes to present, followed by 10 minutes of questions. Practice your presentation and time yourself!
- Your presentation should start with a recap of the key information from the previous presentation (including your project question), but you should spend the majority of your presentation discussing what has happened since then.
- If your presentation is too long, focus it around the most interesting aspects of your project, rather than trying to include every last detail.
- Tell your story in an engaging fashion.
- You are welcome to invite your friends and family members to attend.