Change Log

aut-0.17.0 (2018-10-04)

Full Changelog

Implemented enhancements:

Add EscapeHTML Function for ExtractLinks #266
PySpark support #12

Fixed bugs:

AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
Improve ExtractDomain Normalization #239
Twitter analysis is broken; see also: json4s/json4s#496 #197
Prevent encoding errors in PySpark #122

Closed issues:

Cannot skip bad record while reading warc file #267
Why did Scalastyle not reject null values in TweetUtilTest #255
Create UDF to combine basic text filtering features #253
spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
Extract images out of images DataFrame and store to disk #232
Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
DataFrames for image analysis #220
The attempt to upgrade Spark version to 2.3.0 is not successful #218
Convert nulls to Option(T) #212
Bringing Scala DataFrames into PySpark #209
What is AUT? #208
Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
Codify creation of standard derivatives into apps #195
TweetUtils - support fulltext #192
Combine UDFs into appropriate objects #187
Register Scala functions for use in Pyspark #148
PySpark performance bottlenecks: counting values #130
Redesign of PySpark DataFrame interface for filtering #120
Improve RecordLoader.scala test coverage #60

Merged pull requests:

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272 (borislin)
Update Bug report template. #268 (ruebot)
ExtractBoilerpipeText to remove headers as well. #253 #256 (greebie)
Add additional tweet fields to TweetUtils; partially address #194. #254 (ruebot)
Add support for full_text in tweets; resolve #192. #252 (ruebot)
Get rid of 'filesystem-root relative reference' warning. #251 (ruebot)
Remove stray characters from example commands. #250 (ruebot)
Deal with final scalastyle assessments: Issue 212 #249 (greebie)
Address main scalastyle errors - #196 #248 (greebie)
Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245 (greebie)
Travis build fixes #244 (ruebot)
Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236 (TitusAn)
Save images from dataframe to disk #234 (JWZ2018)
Add missing dependencies in; addresses #227. #233 (ruebot)
Code cleanup: ArchiveRecord + impl moved into same Scala file #230 (lintool)
Add Extract Image Details API #226 (JWZ2018)
Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225 (TitusAn)
Remove duplicate call of keepValidPages #224 (JWZ2018)
Extract Image Links DF API + Test #221 (JWZ2018)
Update Apache Spark to 2.3.0; resolves #218 #219 (ruebot)
Resolve archivesunleashed/docker-aut#17 #217 (ruebot)
Create issue templates #216 (ruebot)
Exposing Scala DataFrames in PySpark #214 (lintool)
Update project description; resolves #208. #211 (ruebot)
Initial DataFrames merge #210 (lintool)
Add more instructions on how to use things to the README. #207 (ruebot)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aut 0.17.0

Change Log

aut-0.17.0 (2018-10-04)