Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partition num when paths contain directories #102

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kosii
Copy link

@kosii kosii commented Apr 4, 2017

Fixes #96

The current method of calculating the number of partitions didn't take
into account the real size of the files in a directory, it used the
size of a directory, i.e. 4k bytes.
Copy link
Contributor

@drcrallen drcrallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, can you please add a unit test for this case?

fs.getFileStatus(p).getLen
if (fs.getFileStatus(p).isDirectory) {
var sum = 0l
val children = fs.listFiles(p, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you scala-fy this? This syntax will be off, but something like:

sum += fs.listFiles(p, false).map(_.getLen).sum

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs.listFiles returns a RemoteIterator, which isn't a subclass of Iterator and as such, there is no implicit converters towards Scala types. I could do something like Iterator.continually(Try { children.next().getLen }).takeWhile(_.isSuccess).map(_.get).sum, but I'm not sure it's better.

Which one you'd prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer for Hadoop to do things in a way that is compatible with Scala :-P

barring that, please add a comment to why the manual iteration is needed.

@kosii
Copy link
Author

kosii commented Apr 10, 2017

Sure, I'll add unittests ASAP (it might be not too soon though, sorry)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants