Draft of Validator Filters #46

ricdezen · 2020-10-02T21:08:22Z

Premise

Pull request containing my work from around 10th of August to 15th of September.
This includes many things. Mainly ValidatorFilter and its subclasses. I do not consider the code to be of particularly high quality.
It will still be useful because it highlighted many weaknesses that we should fix before moving on.

Warning

I am not expecting anyone to read 4700 (not even really good) lines.
Due to the extremely high number of lines, I will keep the PR open as a reference, and split it into smaller PRs.

Features

+ Various utilities, such as CartesianHashTable, ParallelFilter, SieveFilter.
+ ValidatorFilter, dedicated to finding "errors" in Streams.
+ Many subclasses for various kinds of error-finding tasks.
+ Some very rough Analysis classes to find a few errors on our data.

Problems

The code shows us many issues we should address:

Filter is not flexible enough. Many things cannot be implemented as easily as they should (examples: ParallelFilter should have not needed to be a subclass. Round-robin input checking has been attempted for LinearValidator and came out too ugly to keep).
Related to the previous, many many things that our Filters do should be separated. We should have a Graph of classes, not a Tree. To avoid tinkering too much with Filter, single class inheritance has been used. Due to this many validator filters implement features that they should not, such as buffering.
Analysis tries to be a generic Analysis, which is useless as it is. See find_cluster and single_stream_validation. Hard to communicate the analysis's result. The concepts of analysis, validation and results should be defined better in the future.
CartesianHashTable is much less useful than expected. In order to check wether items in a Stream have neighbors in other K Streams, K hashtables are needed, because each table must not contain items from the Stream being checked.
We should define better what is a label for error recognition. As of now, it is a string in the form: "ERRORNAME(info on the error)". I did this to allow JSON serialization. Only way to find wether a label is a certain type of error is regex. Do we need to be able to find which labels represent a certain type of error? Do we want to allow labels to be JSON serializable?

…Unipd/OTRI into refactor/filtering-pattern

…nipd/OTRI into feat/validation-refactor

+ Minor fixes to specs and comments in Filter

- removed old ValidatorFilter + Rewritten as abstract class

+ draft for subclasses of ValidatorFilter

+ moved make_check_date_between to a separate module

+ Implementation of MonoValidator + Minor spec fixes

…t/validation-refactor

+ Reworked Validators as filters that append errors to atoms. feat(exceptions.py): + Added base classes for errors or warnings in atoms' content. feat(valchecks.py): ~ Reduced method verbosity. feat(validation_test.py): + Remade tests for new MonoValidator.

+ Added some exceptions feat (validation.py): + Minor refactor

feat (valchecks.py): Added some checks, those that can be performed by a single function feat (validation.py): Added ContinuityValidator to operate on a Stream's continuity

Added ParallelFilter, waits for all open inputs to have an atom and pops them all at once. feat (filter_test.py): Parameterized Tests to allow running them on ParallelFilter too feat (utils): + __init__.py: Added method listing all modules in the main package.

…e the keys and values that threw the errors

…tain Iterable

…osed non-empty streams

+ Added constructor to have check patameter feat(validation_test): + Added tests for ParallelValidator

CremaLuca · 2020-10-02T21:11:15Z

Here is an overview of what got changed by this pull request:

Issues
======
- Added 36
           

Complexity increasing per file
==============================
- otri/validation/validators/coverage_validator.py  7
- otri/analysis/find_clusters.py  5
- test/validation/validators/coverage_validator_test.py  2
- test/validation/validators/neighbor_validator_test.py  9
- otri/validation/valchecks.py  6
- test/validation/__init__.py  3
- test/validation/valchecks_test.py  3
- test/utils/cartesian_hashtable_test.py  6
- misc/dict_profile.py  2
- misc/profile_cartesian.py  2
- otri/analysis/find_null.py  1
- otri/validation/exceptions.py  3
- misc/profile_neighbors.py  3
- otri/validation/validators/cluster_validator.py  3
- test/validation/validation_test.py  4
- test/filtering/filter_test.py  1
- otri/filtering/filter.py  3
- otri/utils/cartesian_hashtable.py  11
- test/validation/validators/discrepancy_validator_test.py  9
- otri/filtering/filters/sieve_filter.py  2
- otri/validation/validators/discrepancy_validator.py  4
- otri/validation/validators/continuity_validator.py  4
- test/validation/validators/continuity_validator_test.py  3
- single_stream_validation.py  5
- otri/validation/validators/neighbor_validator.py  5
- test/validation/validators/cluster_validator_test.py  3
- otri/analysis/find_negatives.py  1
         

Clones added
============
- otri/analysis/find_clusters.py  1
- test/validation/validators/neighbor_validator_test.py  1
- autocorrelation.py  1
- test/utils/cartesian_hashtable_test.py  2
- misc/dict_profile.py  3
- misc/profile_cartesian.py  3
- otri/analysis/find_null.py  4
- misc/profile_neighbors.py  2
- test/validation/validation_test.py  31
- test/filtering/filter_test.py  10
- test/validation/validators/discrepancy_validator_test.py  1
- single_stream_validation.py  1
- test/validation/validators/cluster_validator_test.py  1
- otri/analysis/find_negatives.py  4

See the complete overview on Codacy

CremaLuca · 2020-10-02T21:11:15Z

otri/utils/cartesian_hashtable.py

+        '''The index at which zero is situated on the axes.'''
+
+        self._table: Iterable[List[T]] = None
+        '''Table containing the buckets.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:16Z

test/utils/cartesian_hashtable_test.py

+        '''
+        self.table = CartesianHashTable(cartesian_tuple)
+        self.table.add((10, 10, 10))
+        pass


Codacy found an issue: Unnecessary pass statement

CremaLuca · 2020-10-02T21:11:17Z

otri/utils/cartesian_hashtable.py

+        '''Table containing the buckets.'''
+
+        self._initialized_buckets: List[Tuple[int]] = list()
+        '''List containing the indexes of the buckets that have been initialized.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:19Z

otri/validation/__init__.py

+        super().__init__([inputs], [outputs], check)
+
+
+class BufferedValidator(LinearValidator):


Codacy found an issue: Method '_check' is abstract in class 'ValidatorFilter' but is not overridden

CremaLuca · 2020-10-02T21:11:20Z

otri/analysis/find_clusters.py

+                GenericFilter(
+                    inputs="db_atoms",
+                    outputs="lower_atoms",
+                    operation=lambda atom: kh.lower_all_keys_deep(atom)


Codacy found an issue: Lambda may not be necessary

Listen to the wise codacy.

CremaLuca · 2020-10-02T21:11:21Z

otri/utils/cartesian_hashtable.py

+        self.get_coordinates: Callable = get_coordinates
+
+        self._cell_count: int = cell_count or CartesianHashTable._DEFAULT_CELL_COUNT
+        '''Size in cells, the same for every dimension by default.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:22Z

otri/utils/cartesian_hashtable.py

+        '''Cell size on axes. Always the ceiling of the max supported value / _size'''
+
+        self._max_value: Sequence[Real] = list(max_values) if max_values else None
+        '''The max values found for each table dimension.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:23Z

otri/utils/cartesian_hashtable.py

+        '''The min values found for each table dimension.'''
+
+        self._zero: Sequence[int] = None
+        '''The index at which zero is situated on the axes.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:24Z

test/validation/validation_test.py

+        )
+        # Hold back Stream 0.
+        validator._hold(0)


Codacy found an issue: Unused variable 'i'

CremaLuca · 2020-10-02T21:11:25Z

otri/utils/cartesian_hashtable.py

+        '''List containing the indexes of the buckets that have been initialized.'''
+
+        self._resize_count: int = 0
+        '''Counter for the resize operations. Used in testing.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:26Z

otri/validation/validators/neighbor_validator.py

+
+        self._table = CartesianHashTable(get_coordinates)
+
+    def _check(self, data: List[Mapping], indexes: List[int]):


Codacy found an issue: An attribute defined in otri.validation line 487 hides this method

CremaLuca · 2020-10-02T21:11:27Z

test/utils/cartesian_hashtable_test.py

+        # Convert dataset
+        expected = [list() for _ in self.dataset[0]]
+        for x in self.dataset:
+            for i in range(len(x)):


Codacy found an issue: Consider using enumerate instead of iterating with range and len

CremaLuca · 2020-10-02T21:11:28Z

otri/utils/cartesian_hashtable.py

+        '''Size in cells, the same for every dimension by default.'''
+
+        self._min_axis_span: Real = CartesianHashTable._MIN_AXIS_SPAN
+        '''The minimum value an axis should cover if the only value ever found was 0.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:29Z

otri/validation/__init__.py

+        append_label(data, label)
+
+
+class LinearValidator(ValidatorFilter):


Codacy found an issue: Method '_check' is abstract in class 'ValidatorFilter' but is not overridden

CremaLuca · 2020-10-02T21:11:30Z

otri/validation/validators/discrepancy_validator.py

+        super().__init__(inputs, outputs)
+        self._limits = limits
+
+    def _check(self, data: List[Mapping], indexes: List[int]):


Codacy found an issue: An attribute defined in otri.validation line 383 hides this method

CremaLuca · 2020-10-02T21:11:31Z

otri/utils/cartesian_hashtable.py

+        '''The minimum value an axis should cover if the only value ever found was 0.'''
+
+        self._count: int = 0
+        '''The number of items in the table.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:32Z

test/validation/validation_test.py

+        )
+        # Hold back Stream 0.
+        validator._hold(0)


Codacy found an issue: Unused variable 'i'

CremaLuca · 2020-10-02T21:11:33Z

otri/utils/cartesian_hashtable.py

+        '''The number of items in the table.'''
+
+        self._dimensions: int = None
+        '''The number of dimensions for this table.'''


Codacy found an issue: String statement has no effect

CremaLuca · 2020-10-02T21:11:34Z

test/utils/cartesian_hashtable_test.py

+    approx = abs(approx)
+    item_coords = coords(item)
+    other_coords = coords(other)
+    for i in range(len(item_coords)):


Codacy found an issue: Consider using enumerate instead of iterating with range and len

CremaLuca · 2020-10-02T21:11:35Z

single_stream_validation.py

+def manage_cluster_result(results):
+    flagged = dict()
+    clusters = dict()
+    for ticker, result in results.items():


Codacy found an issue: Unused variable 'ticker'

CremaLuca · 2020-10-02T21:11:36Z

otri/validation/__init__.py

+                If you don't want to override the class, you can pass a Callable here.
+                The Callable should require the atom batch as a parameter.
+        '''
+        ParallelFilter.__init__(self, inputs, outputs)


Codacy found an issue: init method from a non direct base class 'ParallelFilter' is called

CremaLuca · 2020-10-02T21:11:37Z

otri/validation/__init__.py

+        self._push_data(data, index)
+
+
+class MonoValidator(LinearValidator):


Codacy found an issue: Method '_check' is abstract in class 'ValidatorFilter' but is not overridden

CremaLuca · 2020-10-02T21:11:38Z

otri/validation/__init__.py

+            append_label(atom, result)
+            self._push_data(atom, index)
+
+    def _check(self, data: List[Mapping], indexes: List[int]):


Codacy found an issue: An attribute defined in otri.validation line 383 hides this method

CremaLuca · 2020-10-02T21:11:39Z

test/validation/validators/neighbor_validator_test.py

+        results = [list(output) for output in f._get_outputs()]
+        prepared_outputs = [find(output) for output in results]
+
+        for i in range(len(prepared_outputs)):


Codacy found an issue: Consider using enumerate instead of iterating with range and len

CremaLuca · 2020-10-02T21:11:40Z

test/utils/cartesian_hashtable_test.py

+                expected[i].append(x[i])
+
+        self.assertEqual(len(scatter), len(expected))
+        for i in range(len(scatter)):


Codacy found an issue: Consider using enumerate instead of iterating with range and len

CremaLuca

The first part of the review. The review is indeed very consuming, both mentally and time-wise

CremaLuca · 2020-10-04T14:06:30Z

docs/snp100.json

@@ -0,0 +1,301 @@
+{


Please add

"index": [ "S&P 100" ]

to each line using the find and replace tool.

Find

"ticker"

Replace

"index": [ "S&P 100" ], "ticker"

CremaLuca · 2020-10-04T14:10:11Z

otri/analysis/__init__.py

 from ..filtering.stream import Stream


+def db_share_query(session: Session, atoms_table: str, ticker: str, provider: str) -> Query:


Don't think such method should be in the analysis module, it probably belongs to a future common database interface.

Decide yourself whether you want to keep it for later moving or remove it now.

CremaLuca · 2020-10-04T14:10:53Z

otri/analysis/find_clusters.py

+        Parameters:
+            keys : Set[str]
+                The keys that may contain clusters.
+


Missing cluster size parameter

CremaLuca · 2020-10-04T14:12:42Z

otri/analysis/find_clusters.py

+
+
+class ClusterAnalysis(Analysis):
+


Provide a short description of the analysis: what it does and what it outputs.

Maybe also an example of the output.

CremaLuca · 2020-10-04T14:14:15Z

otri/analysis/find_clusters.py

+                GenericFilter(
+                    inputs="db_atoms",
+                    outputs="lower_atoms",
+                    operation=lambda atom: kh.lower_all_keys_deep(atom)


Listen to the wise codacy.

CremaLuca · 2020-10-04T14:19:01Z

otri/analysis/find_clusters.py

+                    key=key,
+                    limit=self.cluster_size
+                ) for key, each_stream, each_output
+                in zip(self.keys, stream_per_key, output_per_key)


Impressive lines of code!

CremaLuca · 2020-10-04T14:42:43Z

otri/analysis/find_clusters.py

+
+        state = analysis_net.state_dict
+
+        return state, 0, total, elapsed_time


Why is "flagged": 0?

CremaLuca · 2020-10-04T14:46:09Z

otri/utils/cartesian_hashtable.py

+import math
+
+T = TypeVar('T')
+'''Generic type for the CartesianHashTable's contents.'''


Why not using the single line comment with #?

CremaLuca · 2020-10-06T08:46:29Z

otri/utils/__init__.py

+from typing import Set
+
+
+def get_otri_modules() -> Set[str]:


What's the use of this?
Lol it sounds pretty passive-aggressive but I mean it, I'd like to know why you added it.

CremaLuca · 2020-10-06T08:48:19Z

run_all_tests.py

-import os
+'''
+Runs tests, profiles them and prints the result to a "test.prof" file.
+It then opens such file with `snakeviz`.


Nope, you commented out the snakeviz thing.

Riccardo De Zen added 30 commits June 1, 2020 22:41

feat(filter_layer.py): draft of layer execution policies

76c87b4

Merge branch 'refactor/filtering-pattern' of https://github.com/OTRI-…

126279c

…Unipd/OTRI into refactor/filtering-pattern

Merge branch 'refactor/filtering-pattern' of https://github.com/OTRI-…

5e8ca6c

…Unipd/OTRI into refactor/filtering-pattern

feat (validation.py): refactoring draft

1252981

Merge branch 'refactor/filter-functions' of https://github.com/OTRI-U…

cef69ae

…nipd/OTRI into feat/validation-refactor

fix(filter.py):

9335d35

+ Minor fixes to specs and comments in Filter

feat(validation.py):

79a5039

- removed old ValidatorFilter + Rewritten as abstract class

draft(validation.py):

ca92511

+ draft for subclasses of ValidatorFilter

refactor(validation.py):

c29498c

+ moved make_check_date_between to a separate module

feat(validation.py):

a9697c3

+ Implementation of MonoValidator + Minor spec fixes

Merge branch 'develop' of https://github.com/OTRI-Unipd/OTRI into fea…

b28bd86

…t/validation-refactor

feat (validation.py):

e3bbfe7

+ Reworked Validators as filters that append errors to atoms. feat(exceptions.py): + Added base classes for errors or warnings in atoms' content. feat(valchecks.py): ~ Reduced method verbosity. feat(validation_test.py): + Remade tests for new MonoValidator.

feat (exceptions.py):

92ed3b2

+ Added some exceptions feat (validation.py): + Minor refactor

feat (exceptions.py): Added various Errors and a Warning

b4ff516

feat (valchecks.py): Added some checks, those that can be performed by a single function feat (validation.py): Added ContinuityValidator to operate on a Stream's continuity

feat (exceptions.py): Added various Errors and a Warning

0eda0da

feat (valchecks.py): Added some checks, those that can be performed by a single function feat (validation.py): Added ContinuityValidator to operate on a Stream's continuity

feat (filter.py):

6d43079

Added ParallelFilter, waits for all open inputs to have an atom and pops them all at once. feat (filter_test.py): Parameterized Tests to allow running them on ParallelFilter too feat (utils): + __init__.py: Added method listing all modules in the main package.

feat (validation.py): added ParallelValidator

ae75fcc

feat: same number of input and output streams

48d947a

feat: exceptions rework, they take an args dict as parameter which ar…

7bcc54f

…e the keys and values that threw the errors

feat: adapted methods for new exceptions

7a13fd8

Fixed tests to use lists

14953f2

feat: split parallel filter with its own test to test parallelism

b45fdf3

feat: parameterized checks tests

c12fd9c

feat(valchecks_test): fully parameterized tests

3904b5a

feat(valchecks): added make_check_set, check that values are in a cer…

ce5c799

…tain Iterable

feat (valchecks_test): added test for method persistency

4575c98

feat (validation_test): basic tests for LinearValidator

9d9f89e

fix(validation_test): removed useless length check

08d7786

fixed error in ParallelFilter that would prevent getting data from cl…

46d281f

…osed non-empty streams

feat(validation):

7faf65a

+ Added constructor to have check patameter feat(validation_test): + Added tests for ParallelValidator

CremaLuca reviewed Oct 2, 2020

View reviewed changes

test/validation/validation_test.py

)

# Hold back Stream 0.

validator._hold(0)

Copy link

Member

CremaLuca Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: Unused variable 'i'

CremaLuca reviewed Oct 2, 2020

View reviewed changes

test/validation/validation_test.py

)

# Hold back Stream 0.

validator._hold(0)

Copy link

Member

CremaLuca Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: Unused variable 'i'

CremaLuca reviewed Oct 2, 2020

View reviewed changes

CremaLuca reviewed Oct 4, 2020

View reviewed changes

CremaLuca reviewed Oct 6, 2020

View reviewed changes

CremaLuca self-requested a review November 26, 2020 21:36

		super().__init__([inputs], [outputs], check)


		class BufferedValidator(LinearValidator):


		self._table = CartesianHashTable(get_coordinates)

		def _check(self, data: List[Mapping], indexes: List[int]):

		append_label(data, label)


		class LinearValidator(ValidatorFilter):

		self._push_data(data, index)


		class MonoValidator(LinearValidator):

		from ..filtering.stream import Stream


		def db_share_query(session: Session, atoms_table: str, ticker: str, provider: str) -> Query:


		state = analysis_net.state_dict

		return state, 0, total, elapsed_time

Draft of Validator Filters #46

Are you sure you want to change the base?

Draft of Validator Filters #46

Conversation

ricdezen commented Oct 2, 2020 • edited Loading

Premise

Warning

Features

Problems

CremaLuca commented Oct 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CremaLuca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CremaLuca Oct 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricdezen commented Oct 2, 2020 •

edited

Loading

CremaLuca Oct 4, 2020 •

edited

Loading