COMP 479 - Final Project

Description:

This project crawls a set of Concordia webpages to determine general sentiment towards certain subjects.

Crawls pages using: https://github.com/yasserg/crawler4j

Parses pages using: https://boilerpipe-web.appspot.com/

Project Structure:

Scraper:

Scraper can be found in src/main/java

It is configured and run by ScraperController

The scraping is done in MyCrawler

Scraper stores html pages in folders with titles taken from their root pages inside src/html

The scraper library stores certain information that we don't use in src/data

Indexer

Indexer can be found in src/main/java

It uses the following folders in src: blocks, index, stats

The classes inside Models were modified to be used with text instead of xml

The Preprocessor inside IndexBuilder was modified to process html files using BoilerPipe

IMPORTANT: Before the Indexer can be run, make sure you have run the scraper to generate the appropriate html files

Set up:

Dependencies

Java 1.8

For development

Fork project

git clone or download the repo

cd into project

git remote add main original_project_url

To open in Eclipse:

Open eclipse

File -> Import -> Maven -> Existing Maven Project

Choose the cloned/downloaded repo

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.idea		.idea
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml
webscraper.iml		webscraper.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMP 479 - Final Project

Description:

Project Structure:

Scraper:

Indexer

Set up:

Dependencies

For development

To open in Eclipse:

About

Releases

Packages

Contributors 3

Languages

COMP479/final-project

Folders and files

Latest commit

History

Repository files navigation

COMP 479 - Final Project

Description:

Project Structure:

Scraper:

Indexer

Set up:

Dependencies

For development

To open in Eclipse:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages