-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project Headache #194
Project Headache #194
Conversation
Results data structure
Make result classes JSON serializable
Adapt experiment module
expan/core/experiment.py
Outdated
|
||
if 'entity' not in self.data.columns(): | ||
raise RuntimeError("There is no 'entity' column in the data.") | ||
if self.data.entity.duplicated().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it used to be a concern of Expan to filter out those duplicates. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is used to filter out duplicates. Should we still do the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it's quite often the case that data has duplicates — I think we shouldn't break things that we don't have to break and keep at least a resemblance of backward compatibility in terms of functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I think I misunderstood.
Here entity
should not have duplicates because we do our analysis based on this entity level. If there is duplicates, something is wrong when we group the data. And I think we should raise an error if this happens.
expan/core/experiment.py
Outdated
|
||
if test.variants.variant_column_name not in self.data.columns(): | ||
raise RuntimeError("There is no '{}' column in the data.".format(test.variants.variant_column_name)) | ||
if test.variants.treatment_name not in np.unique(self.data[test.variants.variant_column_name]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use set()
instead of np.unique()
? Is the latter substantially faster than the former?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. I think so...I can take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the amount of data we have, I feel the runtime is similar. I can't feel the difference.
But I think np.unique
is better for our pd.Series
because it works on np array level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gbordyugov @daryadedik Actually we have to use pd.unique()
instead of set()
on pandas DataFrame.
I just run the method on test data with 8M rows.
set()
hangs forever without any error message (it's also the bug Ievgen discovered in Expan Service).
np.unique()
takes 10+ seconds.
pd.unique()
takes 0.3 seconds.
expan/core/experiment.py
Outdated
if test.variants.control_name not in np.unique(self.data[test.variants.variant_column_name]): | ||
raise RuntimeError("There is no control with the name '{}' in the data.".format(test.variants.control_name)) | ||
|
||
if not isinstance(test.features, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be checked in the constructor of 'StatisticalTest` and not here, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I will fix it.
expan/core/experiment.py
Outdated
|
||
if not isinstance(test.features, list): | ||
raise TypeError("Features should be a list.") | ||
if not all(isinstance(n, FeatureFilter) for n in test.features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should also go into the constructor of the test, if I'm not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I will fix it.
numerator = "normal_same" | ||
denominator = "normal_shifted" | ||
derived_kpi_name = "derived_kpi_one" | ||
DerivedKPI(derived_kpi_name, numerator, denominator).make_derived_kpi(self.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_derived_kpi()
should be ensured to perform its deeds only once, either here or within the DerivedKPI.make_derived_kpi()
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. Fixed.
Multiple correction method module
# Conflicts: # expan/core/experiment.py
# Conflicts: # CHANGELOG.md # CHANGELOG.rst
# Conflicts: # expan/core/experiment.py
Finish Documentation
@gbordyugov you can still review it in the history of merge pull request, but I am afraid its too big to review anyways. Maybe checkout the code directly is easier. I am gonna merge it now, without releasing. So we can still discuss when you're back. |
Please don't merge yet. Let's merge this branch until all related refactoring is done and tested.