aut 0.50.0
Documentation
Release Notes
Implemented enhancements:
- Enhance keepValidPages #359
- Add discardLanguage filter #352
- Add crawl_date to binary DataFrames and imageLinks #413
Fixed bugs:
- textFiles does not filter properly #390
- DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362
Closed issues:
- .webpages() additional tokenized columns? #402
- Test and documentation inventory #372
- Missing doc comments #392
- Bug in ArcTest? Why run RemoveHTML? #369
- UDF CaMeL cASe consistency issues #368
- ExtractDomain or ExtractBaseDomain? #367
- Align DataFrame boilerplate in Python and Scala #366
- Create a ComputeSHA1 method #363
- Discussion: Should we align our Named Entity Recognition output with WANE format? #297
- DataFrame discussion: open thread #190
Merged pull requests:
- Clean up test descriptions, addresses #372. #416 (ruebot)
- Remaining Matchbox implementations for Scala #415 (SinghGursimran)
- Add crawl_date to binary DataFrames and imageLinks. #414 (ruebot)
- Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406 (ruebot)
- Use https for maven repo. #405 (ruebot)
- Test clean-up. #404 (ruebot)
- Add language detection column to webpages. #403 (ruebot)
- DataFrame Implementation - Serializable APIs #401 (SinghGursimran)
- Filter blank src/dest out of webgraph. #400 (ruebot)
- More df implementations #399 (SinghGursimran)
- Scala imports cleanup. #398 (ruebot)
- More Serializable APIs for DataFrames #396 (SinghGursimran)
- Update ExtractDateRDD test #395 (ruebot)
- Add doc comments for webpages and webgraph; resolves #392. #394 (ruebot)
- Add additional filters for fextFiles; resolves #362. #393 (ruebot)
- API implementations for DataFrame #391 (SinghGursimran)
- Setup for Serializable APIs on DataFrames #389 (SinghGursimran)
- Add and update tests, resolve textFiles bug. #388 (ruebot)
- Dataframe matchbox Implementations #387 (SinghGursimran)
- Clean-up underscore import, and scalastyle warnings. #386 (ruebot)
- Rename pages() to webpages(). #384 (ruebot)
- More Data Frame Implementations + Code Refactoring #383 (SinghGursimran)
- Extract popular images - Data Frame implementation #382 (SinghGursimran)
- Append UDF with RDD or RF. #381 (ruebot)
- Matchbox utilities to DataFrames #380 (SinghGursimran)
- Rename DF functions to be consistent with Python DF functions. #379 (ruebot)
- Converting output of NER Classifier to WANE Format #378 (SinghGursimran)
- Finding Hyperlinks within Collection on Pages with Certain Keyword #377 (SinghGursimran)
- Update README.md #376 (lintool)
- Fix for Issue-368 #374 (SinghGursimran)
- [skip travis] update description. see https://github.com/archivesunle… #373 (ruebot)
- Various UDF implementation and cleanup for DF #370 (lintool)
- Update commons-compress to 1.19; CVE-2019-12402 #365 (ruebot)
- Add ComputeSHA1 method; resolves #363. #364 (ruebot)
- Align NER output to WANE format #361 (ruebot)
- Update keepValidPages to include a filter on 200 OK. #360 (ruebot)
- Update to Spark 2.4.4 #358 (ruebot)
- [skip travis] Update links #357 (ruebot)
- Improve test coverage. #354 (ruebot)
- Add discardLanguage filter to RecordLoader. #353 (ruebot)