-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft of Validator Filters #46
base: develop
Are you sure you want to change the base?
Conversation
…Unipd/OTRI into refactor/filtering-pattern
…Unipd/OTRI into refactor/filtering-pattern
…nipd/OTRI into feat/validation-refactor
+ Minor fixes to specs and comments in Filter
- removed old ValidatorFilter + Rewritten as abstract class
+ draft for subclasses of ValidatorFilter
+ moved make_check_date_between to a separate module
+ Implementation of MonoValidator + Minor spec fixes
…t/validation-refactor
+ Reworked Validators as filters that append errors to atoms. feat(exceptions.py): + Added base classes for errors or warnings in atoms' content. feat(valchecks.py): ~ Reduced method verbosity. feat(validation_test.py): + Remade tests for new MonoValidator.
+ Added some exceptions feat (validation.py): + Minor refactor
feat (valchecks.py): Added some checks, those that can be performed by a single function feat (validation.py): Added ContinuityValidator to operate on a Stream's continuity
feat (valchecks.py): Added some checks, those that can be performed by a single function feat (validation.py): Added ContinuityValidator to operate on a Stream's continuity
Added ParallelFilter, waits for all open inputs to have an atom and pops them all at once. feat (filter_test.py): Parameterized Tests to allow running them on ParallelFilter too feat (utils): + __init__.py: Added method listing all modules in the main package.
…e the keys and values that threw the errors
…osed non-empty streams
+ Added constructor to have check patameter feat(validation_test): + Added tests for ParallelValidator
Here is an overview of what got changed by this pull request: Issues
======
- Added 36
Complexity increasing per file
==============================
- otri/validation/validators/coverage_validator.py 7
- otri/analysis/find_clusters.py 5
- test/validation/validators/coverage_validator_test.py 2
- test/validation/validators/neighbor_validator_test.py 9
- otri/validation/valchecks.py 6
- test/validation/__init__.py 3
- test/validation/valchecks_test.py 3
- test/utils/cartesian_hashtable_test.py 6
- misc/dict_profile.py 2
- misc/profile_cartesian.py 2
- otri/analysis/find_null.py 1
- otri/validation/exceptions.py 3
- misc/profile_neighbors.py 3
- otri/validation/validators/cluster_validator.py 3
- test/validation/validation_test.py 4
- test/filtering/filter_test.py 1
- otri/filtering/filter.py 3
- otri/utils/cartesian_hashtable.py 11
- test/validation/validators/discrepancy_validator_test.py 9
- otri/filtering/filters/sieve_filter.py 2
- otri/validation/validators/discrepancy_validator.py 4
- otri/validation/validators/continuity_validator.py 4
- test/validation/validators/continuity_validator_test.py 3
- single_stream_validation.py 5
- otri/validation/validators/neighbor_validator.py 5
- test/validation/validators/cluster_validator_test.py 3
- otri/analysis/find_negatives.py 1
Clones added
============
- otri/analysis/find_clusters.py 1
- test/validation/validators/neighbor_validator_test.py 1
- autocorrelation.py 1
- test/utils/cartesian_hashtable_test.py 2
- misc/dict_profile.py 3
- misc/profile_cartesian.py 3
- otri/analysis/find_null.py 4
- misc/profile_neighbors.py 2
- test/validation/validation_test.py 31
- test/filtering/filter_test.py 10
- test/validation/validators/discrepancy_validator_test.py 1
- single_stream_validation.py 1
- test/validation/validators/cluster_validator_test.py 1
- otri/analysis/find_negatives.py 4
See the complete overview on Codacy |
'''The index at which zero is situated on the axes.''' | ||
|
||
self._table: Iterable[List[T]] = None | ||
'''Table containing the buckets.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
''' | ||
self.table = CartesianHashTable(cartesian_tuple) | ||
self.table.add((10, 10, 10)) | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Unnecessary pass statement
'''Table containing the buckets.''' | ||
|
||
self._initialized_buckets: List[Tuple[int]] = list() | ||
'''List containing the indexes of the buckets that have been initialized.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
super().__init__([inputs], [outputs], check) | ||
|
||
|
||
class BufferedValidator(LinearValidator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Method '_check' is abstract in class 'ValidatorFilter' but is not overridden
GenericFilter( | ||
inputs="db_atoms", | ||
outputs="lower_atoms", | ||
operation=lambda atom: kh.lower_all_keys_deep(atom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Lambda may not be necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listen to the wise codacy.
self.get_coordinates: Callable = get_coordinates | ||
|
||
self._cell_count: int = cell_count or CartesianHashTable._DEFAULT_CELL_COUNT | ||
'''Size in cells, the same for every dimension by default.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
'''Cell size on axes. Always the ceiling of the max supported value / _size''' | ||
|
||
self._max_value: Sequence[Real] = list(max_values) if max_values else None | ||
'''The max values found for each table dimension.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
'''The min values found for each table dimension.''' | ||
|
||
self._zero: Sequence[int] = None | ||
'''The index at which zero is situated on the axes.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
) | ||
# Hold back Stream 0. | ||
validator._hold(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Unused variable 'i'
'''List containing the indexes of the buckets that have been initialized.''' | ||
|
||
self._resize_count: int = 0 | ||
'''Counter for the resize operations. Used in testing.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
|
||
self._table = CartesianHashTable(get_coordinates) | ||
|
||
def _check(self, data: List[Mapping], indexes: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: An attribute defined in otri.validation line 487 hides this method
# Convert dataset | ||
expected = [list() for _ in self.dataset[0]] | ||
for x in self.dataset: | ||
for i in range(len(x)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Consider using enumerate instead of iterating with range and len
'''Size in cells, the same for every dimension by default.''' | ||
|
||
self._min_axis_span: Real = CartesianHashTable._MIN_AXIS_SPAN | ||
'''The minimum value an axis should cover if the only value ever found was 0.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
append_label(data, label) | ||
|
||
|
||
class LinearValidator(ValidatorFilter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Method '_check' is abstract in class 'ValidatorFilter' but is not overridden
super().__init__(inputs, outputs) | ||
self._limits = limits | ||
|
||
def _check(self, data: List[Mapping], indexes: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: An attribute defined in otri.validation line 383 hides this method
'''The minimum value an axis should cover if the only value ever found was 0.''' | ||
|
||
self._count: int = 0 | ||
'''The number of items in the table.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
) | ||
# Hold back Stream 0. | ||
validator._hold(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Unused variable 'i'
'''The number of items in the table.''' | ||
|
||
self._dimensions: int = None | ||
'''The number of dimensions for this table.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: String statement has no effect
approx = abs(approx) | ||
item_coords = coords(item) | ||
other_coords = coords(other) | ||
for i in range(len(item_coords)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Consider using enumerate instead of iterating with range and len
def manage_cluster_result(results): | ||
flagged = dict() | ||
clusters = dict() | ||
for ticker, result in results.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Unused variable 'ticker'
If you don't want to override the class, you can pass a Callable here. | ||
The Callable should require the atom batch as a parameter. | ||
''' | ||
ParallelFilter.__init__(self, inputs, outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: init method from a non direct base class 'ParallelFilter' is called
self._push_data(data, index) | ||
|
||
|
||
class MonoValidator(LinearValidator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Method '_check' is abstract in class 'ValidatorFilter' but is not overridden
append_label(atom, result) | ||
self._push_data(atom, index) | ||
|
||
def _check(self, data: List[Mapping], indexes: List[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: An attribute defined in otri.validation line 383 hides this method
results = [list(output) for output in f._get_outputs()] | ||
prepared_outputs = [find(output) for output in results] | ||
|
||
for i in range(len(prepared_outputs)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Consider using enumerate instead of iterating with range and len
expected[i].append(x[i]) | ||
|
||
self.assertEqual(len(scatter), len(expected)) | ||
for i in range(len(scatter)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codacy found an issue: Consider using enumerate instead of iterating with range and len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of the review. The review is indeed very consuming, both mentally and time-wise
@@ -0,0 +1,301 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add
"index": [
"S&P 100"
]
to each line using the find and replace tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find
"ticker"
Replace
"index": [
"S&P 100"
],
"ticker"
from ..filtering.stream import Stream | ||
|
||
|
||
def db_share_query(session: Session, atoms_table: str, ticker: str, provider: str) -> Query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think such method should be in the analysis module, it probably belongs to a future common database interface.
Decide yourself whether you want to keep it for later moving or remove it now.
Parameters: | ||
keys : Set[str] | ||
The keys that may contain clusters. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing cluster size parameter
|
||
|
||
class ClusterAnalysis(Analysis): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide a short description of the analysis: what it does and what it outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also an example of the output.
GenericFilter( | ||
inputs="db_atoms", | ||
outputs="lower_atoms", | ||
operation=lambda atom: kh.lower_all_keys_deep(atom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listen to the wise codacy.
key=key, | ||
limit=self.cluster_size | ||
) for key, each_stream, each_output | ||
in zip(self.keys, stream_per_key, output_per_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive lines of code!
|
||
state = analysis_net.state_dict | ||
|
||
return state, 0, total, elapsed_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is "flagged": 0?
import math | ||
|
||
T = TypeVar('T') | ||
'''Generic type for the CartesianHashTable's contents.''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the single line comment with #
?
from typing import Set | ||
|
||
|
||
def get_otri_modules() -> Set[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use of this?
Lol it sounds pretty passive-aggressive but I mean it, I'd like to know why you added it.
import os | ||
''' | ||
Runs tests, profiles them and prints the result to a "test.prof" file. | ||
It then opens such file with `snakeviz`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, you commented out the snakeviz thing.
Premise
Pull request containing my work from around 10th of August to 15th of September.
This includes many things. Mainly ValidatorFilter and its subclasses. I do not consider the code to be of particularly high quality.
It will still be useful because it highlighted many weaknesses that we should fix before moving on.
Warning
I am not expecting anyone to read 4700 (not even really good) lines.
Due to the extremely high number of lines, I will keep the PR open as a reference, and split it into smaller PRs.
Features
+ Various utilities, such as
CartesianHashTable
,ParallelFilter
,SieveFilter
.+
ValidatorFilter
, dedicated to finding "errors" in Streams.+ Many subclasses for various kinds of error-finding tasks.
+ Some very rough
Analysis
classes to find a few errors on our data.Problems
The code shows us many issues we should address:
Filter
is not flexible enough. Many things cannot be implemented as easily as they should (examples:ParallelFilter
should have not needed to be a subclass. Round-robin input checking has been attempted forLinearValidator
and came out too ugly to keep).Filter
, single class inheritance has been used. Due to this many validator filters implement features that they should not, such as buffering.find_cluster
andsingle_stream_validation
. Hard to communicate the analysis's result. The concepts of analysis, validation and results should be defined better in the future.