Skip to content

Issues using SMOTE #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Ayyappatheegala opened this issue Jan 14, 2016 · 30 comments
Closed

Issues using SMOTE #27

Ayyappatheegala opened this issue Jan 14, 2016 · 30 comments

Comments

@Ayyappatheegala
Copy link

Ayyappatheegala commented Jan 14, 2016

Hi
First of all thank you for providing us with the nice library

I have a imbalanced dataset and I've loaded the dataset using pandas.
When I'm supplying the dataset as input to the SMOTE I'm getting the following error:

ValueError: Expected n_neighbors <= n_samples,  but n_samples = 1, n_neighbors = 6

Thanks in Advance

@fmfn
Copy link
Collaborator

fmfn commented Jan 16, 2016

What exactly are you imputing? Do you mind sharing the shape of the dataset and distribution of each class?

The error message seems pretty clear, and indicates you don't have enough samples in your minority class (I'm guessing you are using the regular variation of SMOTE).

@Ayyappatheegala
Copy link
Author

I'm trying to apply the SMOTE Algorithm on my dataset which consists of 93 minority and 250 majority class points.
The dimension of each vector is 67730
i.e the shape of the dataset is (96 * 67730) and (250 * 67730)

Are there any constraints or pre-conditions for using your library ?

@diego898
Copy link

same issue here

@fmfn
Copy link
Collaborator

fmfn commented Feb 16, 2016

@Ayyappatheegala Sorry, I didn't see your message until today.

No constraints, at least that I know about. I don't think it will work with sparse data, and to be honest, given the dimensionality of your dataset, KNN (hence SMOTE) is likely to fail.

Regarding the error you are getting, it is hard for me to know what is happening from the ValueError alone. Perhaps you could share the full error message? The ValueError being raised is likely coming from scikit-learn, it might be due to the object misinterpreting the input data, or maybe the KNN object failing due to the extreme nature of the problem.

@Ayyappatheegala
Copy link
Author

Hi Fernando,
The error has been solved for me.

Its again due to the format of input supplied.
As your package assumes all inputs to be numpy arrays.

@diego898: Pls refer to issue to issue #31

@Pulkit-Khandelwal
Copy link

ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6

I got thensame error though my input format was correct (numpy arrays). The error pops up because one of the classes does not have enough samples. Here, one of the classes have just one sample and hence the error. I resolved this by taking more samples in my dataset.

Also, I was trying to solve a multilabel problem (more than two classes). So, I used SMOTE c-1 times (where c is the number of classes).

@Ayyappatheegala @fmfn @diego898

@vappiah
Copy link

vappiah commented Feb 14, 2018

Hi, I have a similar problem. and i will need your suggestions to help resolve it. First i am trying to solve a multiclass problem but due to imbalance i want to use SMOTE . Find below instances of each class

Class A 9
Class B 644
Class C 2
Class D 289

Error message is
ValueError: Expected n_neighbors <= n_samples, but n_samples = 2, n_neighbors = 6

@glemaitre
Copy link
Member

Internally you cannot make a KNN with 6 samples with a class containing 2 samples.

@abautistah
Copy link

So the point is that you need to have at least 6 samples in each class? I have the same error.

@vappiah
Copy link

vappiah commented Feb 26, 2018

Thanks Everybody.

@glemaitre
Copy link
Member

So the point is that you need to have at least 6 samples in each class? I have the same error.

Yes. You can reduce the number of neighbours used. 6 is the default defined in the original paper.

@barthwalsamarth
Copy link

barthwalsamarth commented Apr 24, 2018

I have the x_val as (28,100) and y_val as (28,1) where 28 is the number of records and 100 are the features and 1 is the regression output for each record. But I still get the default error "ValueError: Expected n_neighbors <= n_samples, but n_samples = 3, n_neighbors = 6"
What am I doing wrong?? And also is SMOTE also for regression or only classification? Below is what I am doing.
sm = SMOTE(ratio=0.5,random_state=10) features_res, labels_res = sm.fit_sample(x_val, y_val)

@glemaitre
Copy link
Member

@barthwalsamarth apparently you have 3 samples in one of the class but ask for 6 neighbours

@barthwalsamarth
Copy link

@glemaitre but I don't have classes here. this is a regression data, there are 100 features all numbers and there is one regression output. That makes me doubtful if SMOTE can be used with regression

@glemaitre
Copy link
Member

That makes me doubtful if SMOTE can be used with regression

Actually you are right. SMOTE is not designed for regression problem but classification. Whatever methods in this package are for classification. We could extend it for regression but we would need to find the right API. The literature is also more shallow for regression.

@ghiander
Copy link

Hello all,

I have read the full thread: I have converted my data into numpy array and used k_neighbors=1, but the algorithm still raises the following error:

Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2

It looks like it does not care about how many neighbors I decide to use to construct my synthetic samples.

@glemaitre
Copy link
Member

Expected n_neighbors <= n_sample

should be clear enough :)

But be careful, because having n_samples=1 is really a corner case for which I am not sure the algorithm will give anything useful.

@ghiander
Copy link

I used k_neighbors=1. Why do the algorithms says: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2?

@ghiander
Copy link

This is not about the setup I am using (which is for testing the library), rather about the effective way the algorithm works.

@glemaitre
Copy link
Member

I used k_neighbors=1. Why do the algorithms says: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2?

This error is raised by the nearest-neighbours of scikit-learn where the parameter is then called n_neighbors and not k_neighbors or m_neighbors which are specific to SMOTE.

We could actually cash the error and raise a more appropriate error with the specific naming. PR welcomed.

By recalling the documentation:

  • k_neighbors is number of nearest neighbours to used to construct synthetic samples.
  • n_neighbors is the number of nearest neighbours to use to determine if a minority sample is in danger.

Therefore, if the nearest-neighbors is given a single sample at fit and has to use the 5 nearest-neighbors (which are not there since we have a single sample), it leads to the given error.

@glemaitre
Copy link
Member

This is not about the setup I am using (which is for testing the library)

Actually this is important because you need to fulfill some minimum assumptions required by the algorithm.

@ghiander
Copy link

I think your answer is not correct:

m_neighbors : int int or object, optional (default=10)
If int, number of nearest neighbors to use to determine if a minority sample is in danger.
The algorithm raises an error by the calling the parameter n_neighbors as you wrote. You recalled the documentation but you misunderstood m_neighbors with n_neighbors.

Secondly, I can definitely tell you that even if I set k_neighbors=1 and m_neighbors=1, the algorithms still raises the same error: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2

I am just trying to understand what it is not working exactly.

@glemaitre
Copy link
Member

The algorithm raises an error by the calling the parameter n_neighbors as you wrote. You recalled the documentation but you misunderstood m_neighbors with n_neighbors.

Nop, I did not misunderstood it. n_neighors refers to the number of neighbors within scikit-learn NN algorithm which can be either m_neighbors or k_neighbors in SMOTE. If you give the trace back I can tell you exactly which one of the NN is actually failing.

Secondly, I can definitely tell you that even if I set k_neighbors=1 and m_neighbors=1, the algorithms still raises the same error: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 2

From the top of the head, the NN used in scikit-learn returns itself as nearest neighbors which is something which we are not interested and therefore the number of neighbors is augmented of one.

But still, I really think that having a single data point is really a corner case.

@ghiander
Copy link

ghiander commented Jun 27, 2018

From the top of the head, the NN used in scikit-learn returns itself as nearest neighbors which is something which we are not interested and therefore the number of neighbors is augmented of one.

Ok, I think that this is the actual answer. So, the neighbors' value is augmented by one, and for this reason, it raises that error. Otherwise, I did not understand why the algorithm was complaining about n_neighbors=2, if I did not specify that anywhere.

Thanks!

@SqrtPapere
Copy link

ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6
Also, I was trying to solve a multilabel problem (more than two classes). So, I used SMOTE c-1 times (where c is the number of classes).

@Ayyappatheegala @fmfn @diego898

Are you sure that you need to apply it c times? I just passed a multi label X and Y in input and the output was new_X, new_Y with all the classes with same number of occurrences!

Maybe they just updated the package to support multi class?

@glemaitre
Copy link
Member

This issue was closed and the comment is outdated. The documentation of SMOTE mentioned that multiclass is supported in a one-vs-rest manner (automatically).

@pracaas
Copy link

pracaas commented Apr 17, 2019

Can i use smote for multiclass classification problem ?

@glemaitre
Copy link
Member

Can i use smote for multiclass classification problem ?

yes

@ballcap231
Copy link

ballcap231 commented Oct 14, 2019

That makes me doubtful if SMOTE can be used with regression

Actually you are right. SMOTE is not designed for regression problem but classification. Whatever methods in this package are for classification. We could extend it for regression but we would need to find the right API. The literature is also more shallow for regression.

Is there any intention to make this SMOTE package capable of working for regression in the near future? It seems to have already been implemented in R.
https://rdrr.io/cran/UBL/man/smoteRegress.html

@glemaitre
Copy link
Member

Not for the moment.

@scikit-learn-contrib scikit-learn-contrib locked as resolved and limited conversation to collaborators Oct 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests