Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of Validator Filters #46

Open
wants to merge 102 commits into
base: develop
Choose a base branch
from
Open

Conversation

ricdezen
Copy link
Contributor

@ricdezen ricdezen commented Oct 2, 2020

Premise

Pull request containing my work from around 10th of August to 15th of September.
This includes many things. Mainly ValidatorFilter and its subclasses. I do not consider the code to be of particularly high quality.
It will still be useful because it highlighted many weaknesses that we should fix before moving on.

Warning

I am not expecting anyone to read 4700 (not even really good) lines.
Due to the extremely high number of lines, I will keep the PR open as a reference, and split it into smaller PRs.

Features

+ Various utilities, such as CartesianHashTable, ParallelFilter, SieveFilter.
+ ValidatorFilter, dedicated to finding "errors" in Streams.
+ Many subclasses for various kinds of error-finding tasks.
+ Some very rough Analysis classes to find a few errors on our data.

Problems

The code shows us many issues we should address:

  • Filter is not flexible enough. Many things cannot be implemented as easily as they should (examples: ParallelFilter should have not needed to be a subclass. Round-robin input checking has been attempted for LinearValidator and came out too ugly to keep).
  • Related to the previous, many many things that our Filters do should be separated. We should have a Graph of classes, not a Tree. To avoid tinkering too much with Filter, single class inheritance has been used. Due to this many validator filters implement features that they should not, such as buffering.
  • Analysis tries to be a generic Analysis, which is useless as it is. See find_cluster and single_stream_validation. Hard to communicate the analysis's result. The concepts of analysis, validation and results should be defined better in the future.
  • CartesianHashTable is much less useful than expected. In order to check wether items in a Stream have neighbors in other K Streams, K hashtables are needed, because each table must not contain items from the Stream being checked.
  • We should define better what is a label for error recognition. As of now, it is a string in the form: "ERRORNAME(info on the error)". I did this to allow JSON serialization. Only way to find wether a label is a certain type of error is regex. Do we need to be able to find which labels represent a certain type of error? Do we want to allow labels to be JSON serializable?

Riccardo De Zen added 30 commits June 1, 2020 22:41
+ Minor fixes to specs and comments in Filter
- removed old ValidatorFilter
+ Rewritten as abstract class
+ draft for subclasses of ValidatorFilter
+ moved make_check_date_between to a separate module
+ Implementation of MonoValidator
+ Minor spec fixes
+ Reworked Validators as filters that append errors to atoms.

feat(exceptions.py):
+ Added base classes for errors or warnings in atoms' content.

feat(valchecks.py):
~ Reduced method verbosity.

feat(validation_test.py):
+ Remade tests for new MonoValidator.
+ Added some exceptions

feat (validation.py):
+ Minor refactor
feat (valchecks.py): Added some checks, those that can be performed by a single function
feat (validation.py): Added ContinuityValidator to operate on a Stream's continuity
feat (valchecks.py): Added some checks, those that can be performed by a single function
feat (validation.py): Added ContinuityValidator to operate on a Stream's continuity
Added ParallelFilter, waits for all open inputs to have an atom and pops them all
at once.

feat (filter_test.py):
Parameterized Tests to allow running them on ParallelFilter too

feat (utils):
+ __init__.py: Added method listing all modules in the main package.
…e the keys and values that threw the errors
+ Added constructor to have check patameter

feat(validation_test):
+ Added tests for ParallelValidator
Copy link
Member

Codacy Here is an overview of what got changed by this pull request:

Issues
======
- Added 36
           

Complexity increasing per file
==============================
- otri/validation/validators/coverage_validator.py  7
- otri/analysis/find_clusters.py  5
- test/validation/validators/coverage_validator_test.py  2
- test/validation/validators/neighbor_validator_test.py  9
- otri/validation/valchecks.py  6
- test/validation/__init__.py  3
- test/validation/valchecks_test.py  3
- test/utils/cartesian_hashtable_test.py  6
- misc/dict_profile.py  2
- misc/profile_cartesian.py  2
- otri/analysis/find_null.py  1
- otri/validation/exceptions.py  3
- misc/profile_neighbors.py  3
- otri/validation/validators/cluster_validator.py  3
- test/validation/validation_test.py  4
- test/filtering/filter_test.py  1
- otri/filtering/filter.py  3
- otri/utils/cartesian_hashtable.py  11
- test/validation/validators/discrepancy_validator_test.py  9
- otri/filtering/filters/sieve_filter.py  2
- otri/validation/validators/discrepancy_validator.py  4
- otri/validation/validators/continuity_validator.py  4
- test/validation/validators/continuity_validator_test.py  3
- single_stream_validation.py  5
- otri/validation/validators/neighbor_validator.py  5
- test/validation/validators/cluster_validator_test.py  3
- otri/analysis/find_negatives.py  1
         

Clones added
============
- otri/analysis/find_clusters.py  1
- test/validation/validators/neighbor_validator_test.py  1
- autocorrelation.py  1
- test/utils/cartesian_hashtable_test.py  2
- misc/dict_profile.py  3
- misc/profile_cartesian.py  3
- otri/analysis/find_null.py  4
- misc/profile_neighbors.py  2
- test/validation/validation_test.py  31
- test/filtering/filter_test.py  10
- test/validation/validators/discrepancy_validator_test.py  1
- single_stream_validation.py  1
- test/validation/validators/cluster_validator_test.py  1
- otri/analysis/find_negatives.py  4
         

See the complete overview on Codacy

'''The index at which zero is situated on the axes.'''

self._table: Iterable[List[T]] = None
'''Table containing the buckets.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

'''
self.table = CartesianHashTable(cartesian_tuple)
self.table.add((10, 10, 10))
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: Unnecessary pass statement

'''Table containing the buckets.'''

self._initialized_buckets: List[Tuple[int]] = list()
'''List containing the indexes of the buckets that have been initialized.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

super().__init__([inputs], [outputs], check)


class BufferedValidator(LinearValidator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericFilter(
inputs="db_atoms",
outputs="lower_atoms",
operation=lambda atom: kh.lower_all_keys_deep(atom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: Lambda may not be necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listen to the wise codacy.

self.get_coordinates: Callable = get_coordinates

self._cell_count: int = cell_count or CartesianHashTable._DEFAULT_CELL_COUNT
'''Size in cells, the same for every dimension by default.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

'''Cell size on axes. Always the ceiling of the max supported value / _size'''

self._max_value: Sequence[Real] = list(max_values) if max_values else None
'''The max values found for each table dimension.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

'''The min values found for each table dimension.'''

self._zero: Sequence[int] = None
'''The index at which zero is situated on the axes.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

)
# Hold back Stream 0.
validator._hold(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: Unused variable 'i'

'''List containing the indexes of the buckets that have been initialized.'''

self._resize_count: int = 0
'''Counter for the resize operations. Used in testing.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect


self._table = CartesianHashTable(get_coordinates)

def _check(self, data: List[Mapping], indexes: List[int]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Convert dataset
expected = [list() for _ in self.dataset[0]]
for x in self.dataset:
for i in range(len(x)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'''Size in cells, the same for every dimension by default.'''

self._min_axis_span: Real = CartesianHashTable._MIN_AXIS_SPAN
'''The minimum value an axis should cover if the only value ever found was 0.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

append_label(data, label)


class LinearValidator(ValidatorFilter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super().__init__(inputs, outputs)
self._limits = limits

def _check(self, data: List[Mapping], indexes: List[int]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'''The minimum value an axis should cover if the only value ever found was 0.'''

self._count: int = 0
'''The number of items in the table.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

)
# Hold back Stream 0.
validator._hold(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: Unused variable 'i'

'''The number of items in the table.'''

self._dimensions: int = None
'''The number of dimensions for this table.'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: String statement has no effect

approx = abs(approx)
item_coords = coords(item)
other_coords = coords(other)
for i in range(len(item_coords)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def manage_cluster_result(results):
flagged = dict()
clusters = dict()
for ticker, result in results.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codacy found an issue: Unused variable 'ticker'

If you don't want to override the class, you can pass a Callable here.
The Callable should require the atom batch as a parameter.
'''
ParallelFilter.__init__(self, inputs, outputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._push_data(data, index)


class MonoValidator(LinearValidator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append_label(atom, result)
self._push_data(atom, index)

def _check(self, data: List[Mapping], indexes: List[int]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results = [list(output) for output in f._get_outputs()]
prepared_outputs = [find(output) for output in results]

for i in range(len(prepared_outputs)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected[i].append(x[i])

self.assertEqual(len(scatter), len(expected))
for i in range(len(scatter)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@CremaLuca CremaLuca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of the review. The review is indeed very consuming, both mentally and time-wise

@@ -0,0 +1,301 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add

"index": [
    "S&P 100"
]

to each line using the find and replace tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find

"ticker"

Replace

"index": [
                "S&P 100"
            ],
            "ticker"

from ..filtering.stream import Stream


def db_share_query(session: Session, atoms_table: str, ticker: str, provider: str) -> Query:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think such method should be in the analysis module, it probably belongs to a future common database interface.

Decide yourself whether you want to keep it for later moving or remove it now.

Parameters:
keys : Set[str]
The keys that may contain clusters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing cluster size parameter



class ClusterAnalysis(Analysis):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide a short description of the analysis: what it does and what it outputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also an example of the output.

GenericFilter(
inputs="db_atoms",
outputs="lower_atoms",
operation=lambda atom: kh.lower_all_keys_deep(atom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listen to the wise codacy.

key=key,
limit=self.cluster_size
) for key, each_stream, each_output
in zip(self.keys, stream_per_key, output_per_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive lines of code!


state = analysis_net.state_dict

return state, 0, total, elapsed_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is "flagged": 0?

import math

T = TypeVar('T')
'''Generic type for the CartesianHashTable's contents.'''
Copy link
Member

@CremaLuca CremaLuca Oct 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the single line comment with #?

from typing import Set


def get_otri_modules() -> Set[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use of this?
Lol it sounds pretty passive-aggressive but I mean it, I'd like to know why you added it.

import os
'''
Runs tests, profiles them and prints the result to a "test.prof" file.
It then opens such file with `snakeviz`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, you commented out the snakeviz thing.

@CremaLuca CremaLuca self-requested a review November 26, 2020 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants