aut 0.17.0
Change Log
aut-0.17.0 (2018-10-04)
Implemented enhancements:
Fixed bugs:
- AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
- AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
- Improve ExtractDomain Normalization #239
- Twitter analysis is broken; see also: json4s/json4s#496 #197
- Prevent encoding errors in PySpark #122
Closed issues:
- Cannot skip bad record while reading warc file #267
- Why did Scalastyle not reject
null
values in TweetUtilTest #255 - Create UDF to combine basic text filtering features #253
- spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
- CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
- Extract images out of images DataFrame and store to disk #232
- Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
- DataFrames for image analysis #220
- The attempt to upgrade Spark version to 2.3.0 is not successful #218
- Convert nulls to Option(T) #212
- Bringing Scala DataFrames into PySpark #209
- What is AUT? #208
- Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
- Codify creation of standard derivatives into apps #195
- TweetUtils - support fulltext #192
- Combine UDFs into appropriate objects #187
- Register Scala functions for use in Pyspark #148
- PySpark performance bottlenecks: counting values #130
- Redesign of PySpark DataFrame interface for filtering #120
- Improve RecordLoader.scala test coverage #60
Merged pull requests:
- Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272 (borislin)
- Update Bug report template. #268 (ruebot)
- ExtractBoilerpipeText to remove headers as well. #253 #256 (greebie)
- Add additional tweet fields to TweetUtils; partially address #194. #254 (ruebot)
- Add support for full_text in tweets; resolve #192. #252 (ruebot)
- Get rid of 'filesystem-root relative reference' warning. #251 (ruebot)
- Remove stray characters from example commands. #250 (ruebot)
- Deal with final scalastyle assessments: Issue 212 #249 (greebie)
- Address main scalastyle errors - #196 #248 (greebie)
- Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245 (greebie)
- Travis build fixes #244 (ruebot)
- Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236 (TitusAn)
- Save images from dataframe to disk #234 (JWZ2018)
- Add missing dependencies in; addresses #227. #233 (ruebot)
- Code cleanup: ArchiveRecord + impl moved into same Scala file #230 (lintool)
- Add Extract Image Details API #226 (JWZ2018)
- Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225 (TitusAn)
- Remove duplicate call of keepValidPages #224 (JWZ2018)
- Extract Image Links DF API + Test #221 (JWZ2018)
- Update Apache Spark to 2.3.0; resolves #218 #219 (ruebot)
- Resolve archivesunleashed/docker-aut#17 #217 (ruebot)
- Create issue templates #216 (ruebot)
- Exposing Scala DataFrames in PySpark #214 (lintool)
- Update project description; resolves #208. #211 (ruebot)
- Initial DataFrames merge #210 (lintool)
- Add more instructions on how to use things to the README. #207 (ruebot)