Skip to content

Use of safe_indexing in StackingCVClassifier #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 31, 2019
Merged

Conversation

ackerleytng
Copy link
Contributor

@ackerleytng ackerleytng commented Mar 24, 2019

Description

Updated StackingCVClassifier to use safe_indexing from sklearn.

Also did some refactoring: instead of iteratively concatenating the model predictions from each fold, I saved all the predictions in a list and concatenated them all at once. Not sure if the iterative concatenation was meant to save memory, but I thought it shouldn't really make a difference since the overhead of storing them all in a list should be quite small.

Related issues or pull requests

None! Was using this for Kaggle and found that it was breaking on fitting, so I added this fix!

Did some refactoring while fixing the code

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran nosetests ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., nosetests ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

@ackerleytng
Copy link
Contributor Author

I have this part (commit 38dccd0) that I thought was necessary but later found that the issue was with my data. Would you like me to remove that commit?

@coveralls
Copy link

coveralls commented Mar 24, 2019

Coverage Status

Coverage decreased (-0.04%) to 91.578% when pulling 724a0e6 on ackerleytng:master into 5016a00 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Mar 24, 2019

Hi there,

I really like the refactoring overall! I am just a bit worried about that regarding other refactoring plans I really liked, i.e.,

#445 via @kota7

StackingEstimator (new, "abstract")
   |-- StackingRegressor
   |-- StackingClassifier
   |-- StackingCVEstimator (new, "abstract")
       |-- StackingCVRegressor
       |-- StackingCVClassifier

Not sure if this is being continued though.

Regarding replacing the arrays via Python lists, I'd actually prefer arrays over python lists, but I think both solutions are not ideal. I think we should know the exact size of the output array, and it would be more elegant/efficient to initialize an array of that size upfront and then just update it with the respective values.

Regarding the safe_indexing, that looks useful! I am just wondering: what would be a use-case scenario where issues with non-consecutive indices would occur? Are there particular scikit-learn CV methods that return arrays with non-consecutive indices that you know? In either case, I think that's maybe something that we should do, because it doesn't hurt to add that :)

@ackerleytng
Copy link
Contributor Author

safe_indexing is actually required before the data is passed to the first level models, so when the data comes in, there's no guarantee that it will be indexable beginning with 1 (I added a test case that shows this).

The CV method then returns a bunch of randomized indices (probably not consecutive), which we use to index the incoming data. The CV method assumes that indices are between 0 and data.shape[0], so indexing it normally (like before) doesn't work.

Oh yes, you're right about initializing it upfront! I didn't think of that. In fact, that's awesome because we can assign by row index and won't have to do the reshuffling later.

Should I wait for the other refactoring effort, or would you like me to do up the lists implementation? Actually, I think some of the methods I refactored could be lifted into parent classes, the other effort is a great idea!

@rasbt
Copy link
Owner

rasbt commented Mar 25, 2019

Should I wait for the other refactoring effort, or would you like me to do up the lists implementation? Actually, I think some of the methods I refactored could be lifted into parent classes, the other effort is a great idea!

Good point, I am actually not sure whether this other PR is still "active" and I would say we probably shouldn't wait :). And like you said, these two things are not orthogonal and the refactoring into parent classes would probably also benefit from your refactoring.

Oh yes, you're right about initializing it upfront! I didn't think of that. In fact, that's awesome because we can assign by row index and won't have to do the reshuffling later.

I think this would be the way to go then, instead of using the list of lists you mentioned above, if that's not too much work/hassle

Also remove need to reorder labels when building meta features
@ackerleytng
Copy link
Contributor Author

@rasbt let me know if this is ok!

I didn't remove the hstack from predict_meta_features because it was only a single for loop building the list.

I'd have to wait till the first iteration before being able to get the number of classes to build the final array for the meta features, so the advantage of declaring the final array upfront is reduced.

@kota7
Copy link
Contributor

kota7 commented Mar 31, 2019

Hi, the refactorization #445 currently faces failure in unit tests and I have been away for a while.
I agree that you shouldn't wait. When I have a chance to fix the problems later I will do so on top of any changes made to the master by then.

@rasbt
Copy link
Owner

rasbt commented Mar 31, 2019

I didn't remove the hstack from predict_meta_features because it was only a single for loop building the list.

I'd have to wait till the first iteration before being able to get the number of classes to build the final array for the meta features, so the advantage of declaring the final array upfront is reduced.

I think that's fine :). The function

    check_is_fitted(self, ['clfs_', 'meta_clf_'])
    per_model_preds = []

    for model in self.clfs_:
        if not self.use_probas:
            prediction = model.predict(X)[:, np.newaxis]
        else:
            prediction = model.predict_proba(X)

        per_model_preds.append(prediction)

    return np.hstack(per_model_preds)

is now so nice and simple due to the refactoring that it is very easy to read what's going on, which is a benefit as well. Later, when/if we refactored all stacking classes based on a BaseStacker, we can think about parallelizing this step. But for now, I think it is already a big improvement over the previous approach.

Unless you have sth else you'd like to add, I'd be happy to merge.

@ackerleytng
Copy link
Contributor Author

Please merge! Thanks so much! Thanks for maintaining mlxtend too!

@rasbt rasbt merged commit a3a539e into rasbt:master Mar 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants