Use of safe_indexing in StackingCVClassifier #513

ackerleytng · 2019-03-24T02:54:27Z

Description

Updated StackingCVClassifier to use safe_indexing from sklearn.

Also did some refactoring: instead of iteratively concatenating the model predictions from each fold, I saved all the predictions in a list and concatenated them all at once. Not sure if the iterative concatenation was meant to save memory, but I thought it shouldn't really make a difference since the overhead of storing them all in a list should be quite small.

Related issues or pull requests

None! Was using this for Kaggle and found that it was breaking on fitting, so I added this fix!

Did some refactoring while fixing the code

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran nosetests ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., nosetests ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

ackerleytng · 2019-03-24T02:59:49Z

I have this part (commit 38dccd0) that I thought was necessary but later found that the issue was with my data. Would you like me to remove that commit?

coveralls · 2019-03-24T03:07:41Z

Coverage decreased (-0.04%) to 91.578% when pulling 724a0e6 on ackerleytng:master into 5016a00 on rasbt:master.

rasbt · 2019-03-24T16:08:46Z

Hi there,

I really like the refactoring overall! I am just a bit worried about that regarding other refactoring plans I really liked, i.e.,

#445 via @kota7

StackingEstimator (new, "abstract")
   |-- StackingRegressor
   |-- StackingClassifier
   |-- StackingCVEstimator (new, "abstract")
       |-- StackingCVRegressor
       |-- StackingCVClassifier

Not sure if this is being continued though.

Regarding replacing the arrays via Python lists, I'd actually prefer arrays over python lists, but I think both solutions are not ideal. I think we should know the exact size of the output array, and it would be more elegant/efficient to initialize an array of that size upfront and then just update it with the respective values.

Regarding the safe_indexing, that looks useful! I am just wondering: what would be a use-case scenario where issues with non-consecutive indices would occur? Are there particular scikit-learn CV methods that return arrays with non-consecutive indices that you know? In either case, I think that's maybe something that we should do, because it doesn't hurt to add that :)

ackerleytng · 2019-03-24T22:43:58Z

safe_indexing is actually required before the data is passed to the first level models, so when the data comes in, there's no guarantee that it will be indexable beginning with 1 (I added a test case that shows this).

The CV method then returns a bunch of randomized indices (probably not consecutive), which we use to index the incoming data. The CV method assumes that indices are between 0 and data.shape[0], so indexing it normally (like before) doesn't work.

Oh yes, you're right about initializing it upfront! I didn't think of that. In fact, that's awesome because we can assign by row index and won't have to do the reshuffling later.

Should I wait for the other refactoring effort, or would you like me to do up the lists implementation? Actually, I think some of the methods I refactored could be lifted into parent classes, the other effort is a great idea!

rasbt · 2019-03-25T14:30:39Z

Should I wait for the other refactoring effort, or would you like me to do up the lists implementation? Actually, I think some of the methods I refactored could be lifted into parent classes, the other effort is a great idea!

Good point, I am actually not sure whether this other PR is still "active" and I would say we probably shouldn't wait :). And like you said, these two things are not orthogonal and the refactoring into parent classes would probably also benefit from your refactoring.

Oh yes, you're right about initializing it upfront! I didn't think of that. In fact, that's awesome because we can assign by row index and won't have to do the reshuffling later.

I think this would be the way to go then, instead of using the list of lists you mentioned above, if that's not too much work/hassle

Also remove need to reorder labels when building meta features

ackerleytng · 2019-03-31T14:18:29Z

@rasbt let me know if this is ok!

I didn't remove the hstack from predict_meta_features because it was only a single for loop building the list.

I'd have to wait till the first iteration before being able to get the number of classes to build the final array for the meta features, so the advantage of declaring the final array upfront is reduced.

kota7 · 2019-03-31T14:22:23Z

Hi, the refactorization #445 currently faces failure in unit tests and I have been away for a while.
I agree that you shouldn't wait. When I have a chance to fix the problems later I will do so on top of any changes made to the master by then.

rasbt · 2019-03-31T19:05:10Z

I didn't remove the hstack from predict_meta_features because it was only a single for loop building the list.

I'd have to wait till the first iteration before being able to get the number of classes to build the final array for the meta features, so the advantage of declaring the final array upfront is reduced.

I think that's fine :). The function

    check_is_fitted(self, ['clfs_', 'meta_clf_'])

    per_model_preds = []

    for model in self.clfs_:
        if not self.use_probas:
            prediction = model.predict(X)[:, np.newaxis]
        else:
            prediction = model.predict_proba(X)

        per_model_preds.append(prediction)

    return np.hstack(per_model_preds)

is now so nice and simple due to the refactoring that it is very easy to read what's going on, which is a benefit as well. Later, when/if we refactored all stacking classes based on a BaseStacker, we can think about parallelizing this step. But for now, I think it is already a big improvement over the previous approach.

Unless you have sth else you'd like to add, I'd be happy to merge.

ackerleytng · 2019-03-31T22:52:33Z

Please merge! Thanks so much! Thanks for maintaining mlxtend too!

ackerleytng added 5 commits March 21, 2019 00:13

Refactor StackingCVClassifer and use safe_indexing

06156df

Fix combining of probabilities by inserting 0 probability columns

38dccd0

Refactor prediction functions

09ce8e7

Fix after running test cases

755f6cb

Add regression test

0aee89a

Write fit to avoid use of hstack or vstack

724a0e6

Also remove need to reorder labels when building meta features

rasbt merged commit a3a539e into rasbt:master Mar 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of safe_indexing in StackingCVClassifier #513

Use of safe_indexing in StackingCVClassifier #513

ackerleytng commented Mar 24, 2019 •

edited

Loading

ackerleytng commented Mar 24, 2019

coveralls commented Mar 24, 2019 •

edited

Loading

rasbt commented Mar 24, 2019

ackerleytng commented Mar 24, 2019

rasbt commented Mar 25, 2019

ackerleytng commented Mar 31, 2019

kota7 commented Mar 31, 2019

rasbt commented Mar 31, 2019

ackerleytng commented Mar 31, 2019

Use of safe_indexing in StackingCVClassifier #513

Use of safe_indexing in StackingCVClassifier #513

Conversation

ackerleytng commented Mar 24, 2019 • edited Loading

Description

Related issues or pull requests

Pull Request Checklist

ackerleytng commented Mar 24, 2019

coveralls commented Mar 24, 2019 • edited Loading

rasbt commented Mar 24, 2019

ackerleytng commented Mar 24, 2019

rasbt commented Mar 25, 2019

ackerleytng commented Mar 31, 2019

kota7 commented Mar 31, 2019

rasbt commented Mar 31, 2019

ackerleytng commented Mar 31, 2019

ackerleytng commented Mar 24, 2019 •

edited

Loading

coveralls commented Mar 24, 2019 •

edited

Loading