Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Headache #194

Merged
merged 91 commits into from
Mar 15, 2018
Merged

Project Headache #194

merged 91 commits into from
Mar 15, 2018

Conversation

shansfolder
Copy link
Contributor

Please don't merge yet. Let's merge this branch until all related refactoring is done and tested.

@coveralls
Copy link

coveralls commented Feb 21, 2018

Coverage Status

Coverage decreased (-1.1%) to 91.058% when pulling fd60447 on headache into a2cd16c on master.


if 'entity' not in self.data.columns():
raise RuntimeError("There is no 'entity' column in the data.")
if self.data.entity.duplicated().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it used to be a concern of Expan to filter out those duplicates. Am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is used to filter out duplicates. Should we still do the same here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it's quite often the case that data has duplicates — I think we shouldn't break things that we don't have to break and keep at least a resemblance of backward compatibility in terms of functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I think I misunderstood.

Here entity should not have duplicates because we do our analysis based on this entity level. If there is duplicates, something is wrong when we group the data. And I think we should raise an error if this happens.


if test.variants.variant_column_name not in self.data.columns():
raise RuntimeError("There is no '{}' column in the data.".format(test.variants.variant_column_name))
if test.variants.treatment_name not in np.unique(self.data[test.variants.variant_column_name]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use set() instead of np.unique()? Is the latter substantially faster than the former?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I think so...I can take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the amount of data we have, I feel the runtime is similar. I can't feel the difference.

But I think np.unique is better for our pd.Series because it works on np array level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gbordyugov @daryadedik Actually we have to use pd.unique() instead of set() on pandas DataFrame.

I just run the method on test data with 8M rows.
set() hangs forever without any error message (it's also the bug Ievgen discovered in Expan Service).
np.unique() takes 10+ seconds.
pd.unique() takes 0.3 seconds.

if test.variants.control_name not in np.unique(self.data[test.variants.variant_column_name]):
raise RuntimeError("There is no control with the name '{}' in the data.".format(test.variants.control_name))

if not isinstance(test.features, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be checked in the constructor of 'StatisticalTest` and not here, I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I will fix it.


if not isinstance(test.features, list):
raise TypeError("Features should be a list.")
if not all(isinstance(n, FeatureFilter) for n in test.features):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should also go into the constructor of the test, if I'm not mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I will fix it.

numerator = "normal_same"
denominator = "normal_shifted"
derived_kpi_name = "derived_kpi_one"
DerivedKPI(derived_kpi_name, numerator, denominator).make_derived_kpi(self.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_derived_kpi() should be ensured to perform its deeds only once, either here or within the DerivedKPI.make_derived_kpi() method

Copy link
Contributor Author

@shansfolder shansfolder Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Fixed.

shansfolder and others added 28 commits March 6, 2018 16:23
Multiple correction method module
# Conflicts:
#	expan/core/experiment.py
# Conflicts:
#	CHANGELOG.md
#	CHANGELOG.rst
# Conflicts:
#	expan/core/experiment.py
@shansfolder
Copy link
Contributor Author

@gbordyugov you can still review it in the history of merge pull request, but I am afraid its too big to review anyways. Maybe checkout the code directly is easier.

I am gonna merge it now, without releasing. So we can still discuss when you're back.

@shansfolder shansfolder merged commit 648a490 into master Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants