03 Jun 23:47

ruebot

2442336

aut-0.80.0

Documentation

Release Notes

Full Changelog

Closed issues:

Broken link in documentation #476
Improve udfs/package.scala test coverage #473
Remove tabDelimit #471
Remove Extract Entities #469
PEP8 Naming - UDFs, App method names, DataFrame names, and filters. #468
Python UDFs - class or not? #467
Remove ExtractImageDetailsDF.scala #464
github-stite-deploy uses password based authentication which is being deprecated by GitHub #461
Implement Python versions of Serializable APIs #410
Implement Python versions of App utilities #409
Implement Python versions of Matchbox utilities #408
Improve TupleFormatter.scala test coverage #59
Create tests for NERCombinedJson.scala #53
Create tests for NER3Classifier.scala #52
Create tests for ExtractEntities.scala #48

Merged pull requests:

Remove RDD suffixes on file, class, and object names. #479 (ruebot)
PEP8 Python app method names. #477 (ruebot)
Move Python UDF methods out of their own class. #475 (ruebot)
Add DataFrame udf tests. #474 (ruebot)
Remove tabDelimit. #472 (ruebot)
Remove NER functionality. #470 (ruebot)
Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python. #466 (ruebot)
Remove ExtractImageDetailsDF; resolves #464. #465 (ruebot)
Implement Scala Matchbox UDFs in Python. #463 (ruebot)
Import clean-up for df package. #462 (ruebot)

Assets 30

05 May 00:19

ruebot

aut-0.70.0

df43ac6

aut-0.70.0

Documentation

Release Notes

Full Changelog

Implemented enhancements:

Update PlainTextExtractor to just extract text #452
Migration of all RDD functionality over to DataFrames #223

Fixed bugs:

DomainFrequencyExtractor should remove WWW prefix #456

Closed issues:

For extractor (spark-submit) job, set Spark app name to be the extractor job name. #458
Remove RDD options from app #449
Add parquet as an app format option #448
Add datathon derivatives to app (binary info, web pages, web graph #447
Update Java 8 instructions for MacOS #445
Add spark-submit to README #444

Merged pull requests:

[skip travis] README updates #460 (ruebot)
Set spark-submit app name to be "aut - extractorName". #459 (ruebot)
Add RemovePrefixWWWDF to DomainFrequencyExtractor. #457 (ruebot)
Updating Java install instructions for MacOS, resolves #445 #455 (ianmilligan1)
Add option to save to Parquet for app. #454 (ruebot)
Update PlainTextExtractor to output a single column; text. #453 (ruebot)
Add a number of additional app extractors. #451 (ruebot)
Remove RDD option in app; DataFrame only now. #450 (ruebot)
[skip-travis] Add spark-submit option to README; resolves #444. #446 (ruebot)

Assets 30

15 Apr 19:41

ruebot

aut-0.60.0

d0f9761

aut 0.60.0

Documentation

Release Notes

Full Changelog

Implemented enhancements:

Discussion: Restyle UDFs in the context of DataFrames #425
Add alt text column to imageGraph (imageLinks) #420
UDFs that filter on url should also filter on src #418

Fixed bugs:

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
DomainGraphExtractor produces different output in RDD vs DF #436
Command line app fails because of missing log4j configuration #433

Closed issues:

Remove GraphXML and ExtractGraphX #442
Use Monochromatic Ids instead of hash to produce network identifiers. #440
Add graphml output to DomainGraphExtractor #435
Add webgraph, imagegraph, webpages, etc. to command line app #431
Rename imageLinks to imageGraph #419

Merged pull requests:

Remove GraphX support; resolves #442. #443 (ruebot)
Remove WriteGraph; resolves #439. #441 (ruebot)
Add graphml output to CommandLineApp and DomainGraphExtractor. #438 (ruebot)
Align RDD and DF output for DomainGraphExtractor. #437 (ruebot)
Update log4j configuration to resolve #433. #434 (ruebot)
Add imagegraph, and webgraph to command line app. #432 (ruebot)
Tweak hasDate to handle Seq. #430 (ruebot)
Restyle keep/discard filter UDFs in the context of DataFrames #429 (ruebot)
Update Spark and Hadoop versions. #426 (ruebot)
update for 'src' column #424 (SinghGursimran)
[skip travis] Add pre-print link to README. #423 (ruebot)
Add img alt text to imagegraph(); resolves #420. #422 (ruebot)
Rename imageLinks to imageGraph; resolves #419 #421 (ruebot)
Need --repositories flag with --packages. #417 (ruebot)

Assets 30

06 Feb 01:20

ruebot

aut-0.50.0

4d1dcc9

aut 0.50.0

Documentation

Release Notes

Full Changelog

Implemented enhancements:

Enhance keepValidPages #359
Add discardLanguage filter #352
Add crawl_date to binary DataFrames and imageLinks #413

Fixed bugs:

textFiles does not filter properly #390
DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

Closed issues:

.webpages() additional tokenized columns? #402
Test and documentation inventory #372
Missing doc comments #392
Bug in ArcTest? Why run RemoveHTML? #369
UDF CaMeL cASe consistency issues #368
ExtractDomain or ExtractBaseDomain? #367
Align DataFrame boilerplate in Python and Scala #366
Create a ComputeSHA1 method #363
Discussion: Should we align our Named Entity Recognition output with WANE format? #297
DataFrame discussion: open thread #190

Merged pull requests:

Clean up test descriptions, addresses #372. #416 (ruebot)
Remaining Matchbox implementations for Scala #415 (SinghGursimran)
Add crawl_date to binary DataFrames and imageLinks. #414 (ruebot)
Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406 (ruebot)
Use https for maven repo. #405 (ruebot)
Test clean-up. #404 (ruebot)
Add language detection column to webpages. #403 (ruebot)
DataFrame Implementation - Serializable APIs #401 (SinghGursimran)
Filter blank src/dest out of webgraph. #400 (ruebot)
More df implementations #399 (SinghGursimran)
Scala imports cleanup. #398 (ruebot)
More Serializable APIs for DataFrames #396 (SinghGursimran)
Update ExtractDateRDD test #395 (ruebot)
Add doc comments for webpages and webgraph; resolves #392. #394 (ruebot)
Add additional filters for fextFiles; resolves #362. #393 (ruebot)
API implementations for DataFrame #391 (SinghGursimran)
Setup for Serializable APIs on DataFrames #389 (SinghGursimran)
Add and update tests, resolve textFiles bug. #388 (ruebot)
Dataframe matchbox Implementations #387 (SinghGursimran)
Clean-up underscore import, and scalastyle warnings. #386 (ruebot)
Rename pages() to webpages(). #384 (ruebot)
More Data Frame Implementations + Code Refactoring #383 (SinghGursimran)
Extract popular images - Data Frame implementation #382 (SinghGursimran)
Append UDF with RDD or RF. #381 (ruebot)
Matchbox utilities to DataFrames #380 (SinghGursimran)
Rename DF functions to be consistent with Python DF functions. #379 (ruebot)
Converting output of NER Classifier to WANE Format #378 (SinghGursimran)
Finding Hyperlinks within Collection on Pages with Certain Keyword #377 (SinghGursimran)
Update README.md #376 (lintool)
Fix for Issue-368 #374 (SinghGursimran)
[skip travis] update description. see https://github.com/archivesunle… #373 (ruebot)
Various UDF implementation and cleanup for DF #370 (lintool)
Update commons-compress to 1.19; CVE-2019-12402 #365 (ruebot)
Add ComputeSHA1 method; resolves #363. #364 (ruebot)
Align NER output to WANE format #361 (ruebot)
Update keepValidPages to include a filter on 200 OK. #360 (ruebot)
Update to Spark 2.4.4 #358 (ruebot)
[skip travis] Update links #357 (ruebot)
Improve test coverage. #354 (ruebot)
Add discardLanguage filter to RecordLoader. #353 (ruebot)

Assets 21

17 Jan 14:34

ruebot

aut-0.18.1

59b6062

aut 0.18.1

Fix for #407

Assets 21

21 Aug 18:42

ruebot

aut-0.18.0

95e5f03

aut 0.18.0

aut-0.18.0 (2019-08-21)

Full Changelog

Implemented enhancements:

Add method for unknown extensions in binary extractions #343
Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
Add filter/keep by http status to RecordLoader class #315
Audio binary object extraction #307
Video binary object extraction #306
Powerpoint binary object extraction #305
Doc binary object extraction #304
Spreadsheet binary object extraction #303
PDF binary object extraction #302
Test aut with Apache Spark 2.4.0 #295
Replace hashing of unique ids with .zipWithUniqueId() #243
Integration of neural network models for image analysis #240
More complete Twitter Ingestion #194
Image Search Functionality #165
feature request: log when loadArchives opens and closes warc files in a dir #156

Fixed bugs:

DataFrame commands throwing java.lang.NullPointerException on example data #320
Class issues when using aut-0.17.0-fatjar.jar #313
Image extraction does not scale with number of WARCs #298
ExtractDomain mistakenly checks source first then url #277
Improve ExtractDomain to Better Isolate Domains #269

Closed issues:

Inconsistency in ArchiveRecord.getContentBytes #334
Rationalize computeHash and ComputeMD5 #333
Test additional Java versions with TravisCI #324
Remove Twitter/tweet analysis #322
Trouble testing s3 connectivity #319
Depfu Error: No dependency files found #309
Strategy to deal with conflict between application and Spark distribution dependencies #308
SaveImageTest.scala should delete saved image file #299
Remove Deprecated ExtractGraph.scala file for next release. #291
DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286
CVE-2017-7525 -- com.fasterxml.jackson.core:jackson-databind #279
Maven build warning during release #273
Improve DataFrameLoader.scala test coverage #265
Improve package.scala test coverage #263
Discussion: Idiom for loading DataFrames #231
DataFrame field names: open thread #229
DataFrame performance comparison: Scala vs. Python #215
TweetUtilsTest.scala doesn't test Spark, only underlying json4s library #206
feature request: ArchiveRecord.archiveFile #164
feature request: possibility to query about the progress #162
Update to Apache Tika 1.19.1; security vulnerabilities in 1.12 #131
Create tests for ExtractGraph.scala #49
Setup Victims #5

Merged pull requests:

Update LICENSE and license headers. #351 (ruebot)
Add binary extraction DataFrames to PySpark. #350 (ruebot)
Add method for determining binary file extension #349 (jrwiebe)
Add keep and discard by http status. #347 (ruebot)
Add office document binary extraction. #346 (ruebot)
Use version of tika-parsers without a classifier #345 (jrwiebe)
Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344 (ruebot)
Add Audio & Video binary extraction #341 (ruebot)
Extract PDF #340 (jrwiebe)
More scalastyle work; addresses #196. #339 (ruebot)
Replace computeHash with ComputeMD5; resolves #333. #338 (ruebot)
Update Tika to 1.22; address security alerts. #337 (ruebot)
Tests #336 (ruebot)
Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335 (ianmilligan1)
Enable S3 access #332 (jrwiebe)
Updates to pom following 0e701b2 #328 (ruebot)
Move data frame fields names to snake_case. #327 (ruebot)
Python formatting, and gitignore additions. #326 (ruebot)
Test Java 8 & 11, and remove OracleJDK; resolves #324. #325 (ruebot)
Remove Tweet utils. #323 (ruebot)
Update to Spark 2.4.3 and update Tika to 1.20. #321 (ruebot)
add image analysis w/ tensorflow #318 (h324yang)
Makes ArchiveRecordImpl serializable #316 (jrwiebe)
Resolve cobertura-maven-plugin class issue; resolves #313. #314 (ruebot)
Update spark-core_2.11 to 2.3.1. #312 (ruebot)
Log closing of ARC and WARC files, per #156 #301 (jrwiebe)
Delete saved image file; resolves #299 #300 (jrwiebe)
Remove Deprecated ExtractGraph app #293 (greebie)
Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292 (greebie)
Update license headers for #208. #290 (ruebot)
Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289 (greebie)
CVE-2018-11771 update #288 (ruebot)
CVE-2017-17485 update; follow-on to #281. #287 (ruebot)
Update Apache Tika - security vulnerabilities; resolves #131. #285 (ruebot)
[skip travis]...

Assets 29

04 Oct 21:42

ruebot

aut-0.17.0

694382c

aut 0.17.0

Change Log

aut-0.17.0 (2018-10-04)

Full Changelog

Implemented enhancements:

Add EscapeHTML Function for ExtractLinks #266
PySpark support #12

Fixed bugs:

AUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246
Improve ExtractDomain Normalization #239
Twitter analysis is broken; see also: json4s/json4s#496 #197
Prevent encoding errors in PySpark #122

Closed issues:

Cannot skip bad record while reading warc file #267
Why did Scalastyle not reject null values in TweetUtilTest #255
Create UDF to combine basic text filtering features #253
spark-shell --packages "io.archivesunleashed:aut:0.16.0" fails with not_found dependencies #242
CommandLineAppRunner.scala produces output per WARC instead of combined result. #235
Extract images out of images DataFrame and store to disk #232
Before the next release, make sure docker-aut builds on master... or make sure --packages works #227
DataFrames for image analysis #220
The attempt to upgrade Spark version to 2.3.0 is not successful #218
Convert nulls to Option(T) #212
Bringing Scala DataFrames into PySpark #209
What is AUT? #208
Refactor ExtractGraph and assess value of GraphX for producing network graphs #203
Codify creation of standard derivatives into apps #195
TweetUtils - support fulltext #192
Combine UDFs into appropriate objects #187
Register Scala functions for use in Pyspark #148
PySpark performance bottlenecks: counting values #130
Redesign of PySpark DataFrame interface for filtering #120
Improve RecordLoader.scala test coverage #60

Merged pull requests:

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272 (borislin)
Update Bug report template. #268 (ruebot)
ExtractBoilerpipeText to remove headers as well. #253 #256 (greebie)
Add additional tweet fields to TweetUtils; partially address #194. #254 (ruebot)
Add support for full_text in tweets; resolve #192. #252 (ruebot)
Get rid of 'filesystem-root relative reference' warning. #251 (ruebot)
Remove stray characters from example commands. #250 (ruebot)
Deal with final scalastyle assessments: Issue 212 #249 (greebie)
Address main scalastyle errors - #196 #248 (greebie)
Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245 (greebie)
Travis build fixes #244 (ruebot)
Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236 (TitusAn)
Save images from dataframe to disk #234 (JWZ2018)
Add missing dependencies in; addresses #227. #233 (ruebot)
Code cleanup: ArchiveRecord + impl moved into same Scala file #230 (lintool)
Add Extract Image Details API #226 (JWZ2018)
Implement DomainFrequency, DomainGraph and PlainText extractor that can be run from command line #225 (TitusAn)
Remove duplicate call of keepValidPages #224 (JWZ2018)
Extract Image Links DF API + Test #221 (JWZ2018)
Update Apache Spark to 2.3.0; resolves #218 #219 (ruebot)
Resolve archivesunleashed/docker-aut#17 #217 (ruebot)
Create issue templates #216 (ruebot)
Exposing Scala DataFrames in PySpark #214 (lintool)
Update project description; resolves #208. #211 (ruebot)
Initial DataFrames merge #210 (lintool)
Add more instructions on how to use things to the README. #207 (ruebot)

Assets 8

26 Apr 20:51

ruebot

aut-0.16.0

5d9c515

aut 0.16.0

Full Changelog

Implemented enhancements:

Revisit approach to .keepValidPages() #177

Closed issues:

keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

Merged pull requests:

Unbork'ing tweet analysis (Fixes Issue 197) - take2 #205 (lintool)
Update README.md #202 (lintool)
Code reformatting #201 (lintool)
fix #199: mime-type was incorrectly parsed from content-type when cha… #200 (dportabella)

Assets 26

11 Apr 01:16

ruebot

aut-0.15.0

9e874c3

aut 0.15.0

aut-0.15.0 (2018-04-11)

Full Changelog

Implemented enhancements:

Clean-up scaladoc comments #184

Closed issues:

Rename package io.archivesunleashed.io #188
Major Refactoring: RecordRDD #180
Major refactoring: matchbox cleanup #179
Major refactoring: io.archivesunleashed.spark -> io.archivesunleashed #178

Merged pull requests:

Improve and clean-up Scaladocs; resolves #184 #193 (ruebot)
Major refactoring of package structure #189 (lintool)
make ArchiveRecord a trait #186 (helgeho)

This Change Log was automatically generated by github_changelog_generator

Assets 14

20 Mar 19:22

ruebot

aut-0.14.0

a0f5dbf

aut 0.14.0

Full Changelog

Closed issues:

Incorporate Scala UDFs into Auto-documentation #176

Merged pull requests:

Resolve #176; setup scaladocs. #183 (ruebot)
Revert "make ArchiveRecord a trait (#175)" #181 (ruebot)

Assets 14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation

Release Notes

Documentation

Release Notes

Documentation

Release Notes

Documentation

Release Notes

aut-0.18.0 (2019-08-21)

Change Log

aut-0.17.0 (2018-10-04)

aut-0.15.0 (2018-04-11)

Releases: archivesunleashed/aut

aut-0.80.0

Documentation

Release Notes

aut-0.70.0

Documentation

Release Notes

aut 0.60.0

Documentation

Release Notes

aut 0.50.0

Documentation

Release Notes

aut 0.18.1

aut 0.18.0

aut-0.18.0 (2019-08-21)

aut 0.17.0

Change Log

aut-0.17.0 (2018-10-04)

aut 0.16.0

aut 0.15.0

aut-0.15.0 (2018-04-11)

aut 0.14.0