Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliability vs high throughput #2

Open
fnobilia opened this issue Sep 29, 2016 · 4 comments
Open

Reliability vs high throughput #2

fnobilia opened this issue Sep 29, 2016 · 4 comments

Comments

@fnobilia
Copy link
Member

fnobilia commented Sep 29, 2016

Kafka messages can be consumed according to two different strategies:

  • at most once messages can be lost due to consumer faults
  • at least once messages can be read twice due to consumer faults

This design choice drives the consumer implementation.

We may plan to code both approaches, and select the most suitable after tests.

EDIT @blootsvoets: reversed at most and at least

@blootsvoets
Copy link
Member

Since we add timestamps to all messages, I'd opt for at least once processing to start, we can always establish unicity by the timestamp. It may be harder to retrieve lost data.

@fnobilia
Copy link
Member Author

fnobilia commented Sep 29, 2016

Yes but this approach open another issue.

We would use group of consumer to maximise the throughput. Groups implies that Kafka periodically and automatically re-assigns partitions across consumers inside a single group for balancing the work-load. This results in a new generation of the group. After the rebalance, Consumer_X may manage partitions previously handled by consumer_Y. The last seen timestamp per each key cannot be stored locally (e.g. inside each consumer), this information has to be stored in a data-structure shared among all group members. A NonBlockingHashMap could solve this issue.

@afolarin
Copy link
Member

afolarin commented Sep 29, 2016

@fnobilia Presumably this is going to depend on the requirements of the Consumer / use case? I think At-Least-Once is generally recommended though.

https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIgetexactly-oncemessagingfromKafka? As far as I'm aware Exactly-Once-Consumption can be done but Consumers need to locally transactionally checkpoint. However in practice this overhead also places constraints on throughput.

Also take a look at Flink Checkpointing http://data-artisans.com/kafka-flink-a-practical-how-to/

@fnobilia
Copy link
Member Author

Unfortunately the exactly-once is under active investigation (1 and 2).

Apache Flink may be useful in near future to provide fault-tolerance and scalability within the analysis layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants