Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first nonzero element (Trac #1673) #2269

Open
numpy-gitbot opened this issue Oct 19, 2012 · 33 comments
Open

first nonzero element (Trac #1673) #2269

numpy-gitbot opened this issue Oct 19, 2012 · 33 comments

Comments

@numpy-gitbot
Copy link

Original ticket http://projects.scipy.org/numpy/ticket/1673 on 2010-11-13 by trac user tom3118, assigned to unknown.

The "numpy for matlab users" suggests using
nonzero(A)[0][0]
to find the index of the first nonzero element of array A.

The problem with this is that A might be a million elements long and the first element might be zero.

This is an extremely common operation. An efficient, built-in method for this would be very useful. It also would easy people's transition from Matlab in which find is so common.

@numpy-gitbot
Copy link
Author

trac user tom3118 wrote on 2010-11-13

A related use case is:
filter(test,A)[0]
In which either A is long or test is expensive.

@numpy-gitbot
Copy link
Author

@rgommers wrote on 2011-03-24

Doesn't have to only be first nonzero, first any value would be useful.

@numpy-gitbot
Copy link
Author

@rgommers wrote on 2011-03-24

As noted in #2333, the meaning is unambiguous for 1-D. For >1-D semantics are up for discussion.

Perhaps a keyword that determines the order of iteration over axes works. Or it can simply be undefined for >1-D.

@numpy-gitbot
Copy link
Author

trac user lcampagn wrote on 2011-07-09

I have seen many requests for a find_first in numpy, but most of these requests have subtly different (and incompatible) requirements such as "find first value less than x" or "find first nonzero value". I suggest the following function specification:

  ind = array.find(x, testOp='eq', arrayOp='all', axis=0, test=None)
  arguments:
    x       -> value to search for
    testOp  -> condition to test for ('eq', 'ne', 'gt', 'lt', 'ge', 'le')
    arrayOp -> method for joining multiple comparisons ('any' or 'all')
    axis    -> the axis over which to search
    test    -> for convenience, this may specify a function to call to perform
               the test. This is not expected to be efficient.
  returns: 
    first index where condition is true (or test returns true, if given)
    or None if the condition was never met

If the array has ndim > 1, then tests are performed using normal broadcasting rules.
So for example, if I have an array with shape (2,3), the following would be valid:

  ## find first row with all values=0
  array.find(0, testOp='eq', arrayOp='all', axis=0)
  ## equivalent to:
  for i in range(array.shape[axis]):
    if (array[i] == 0).all():
      return i

  ## find first column with any element greater than its corresponding element in col
  col = array([1,2])
  array.find(col, testOp='gt', arrayOp='any', axis=1)
  ## equivalent to:
  for i in range(array.shape[axis]):
    if (array[:,i] == col.any():
      return i

@pelson
Copy link
Contributor

pelson commented Mar 5, 2013

As I needed this functionality the other day, I've had a good look into this and was convinced that a C solution was needed to get a suitably quick result, however a chunking approach written in python has proved to be suitably quick, and much more flexible to boot, for my case.

import numpy as np
from itertools import chain, izip


def find(a, predicate, chunk_size=1024):
    """
    Find the indices of array elements that match the predicate.

    Parameters
    ----------
    a : array_like
        Input data, must be 1D.

    predicate : function
        A function which operates on sections of the given array, returning
        element-wise True or False for each data value.

    chunk_size : integer
        The length of the chunks to use when searching for matching indices.
        For high probability predicates, a smaller number will make this
        function quicker, similarly choose a larger number for low
        probabilities.

    Returns
    -------
    index_generator : generator
        A generator of (indices, data value) tuples which make the predicate
        True.

    See Also
    --------
    where, nonzero

    Notes
    -----
    This function is best used for finding the first, or first few, data values
    which match the predicate.

    Examples
    --------
    >>> a = np.sin(np.linspace(0, np.pi, 200))
    >>> result = find(a, lambda arr: arr > 0.9)
    >>> next(result)
    ((71, ), 0.900479032457)
    >>> np.where(a > 0.9)[0][0]
    71


    """
    if a.ndim != 1:
        raise ValueError('The array must be 1D, not {}.'.format(a.ndim))

    i0 = 0
    chunk_inds = chain(xrange(chunk_size, a.size, chunk_size), 
                 [None])

    for i1 in chunk_inds:
        chunk = a[i0:i1]
        for inds in izip(*predicate(chunk).nonzero()):
            yield (inds[0] + i0, ), chunk[inds]
        i0 = i1
In [1]: from np_utils import find

In [2]: import numpy as np

In [3]: import numpy.random    

In [4]: np.random.seed(1)

In [5]: a = np.random.randn(1e8)

In [6]: a.min(), a.max()
Out[6]: (-6.1194900990552776, 5.9632246301166321)

In [7]: next(find(a, lambda a: np.abs(a) > 6))
Out[7]: ((33105441,), -6.1194900990552776)

In [8]: (np.abs(a) > 6).nonzero()
Out[8]: (array([33105441]),)

In [9]: %timeit (np.abs(a) > 6).nonzero()
1 loops, best of 3: 1.51 s per loop

In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6))
1 loops, best of 3: 912 ms per loop

In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=100000))
1 loops, best of 3: 470 ms per loop

In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=1000000))
1 loops, best of 3: 483 ms per loop

I'll pop this on the dev-mailing list, but if there is sufficient interest I'd be happy enough to turn it into a PR.

Cheers,

@sachinruk
Copy link

I know this is 3 years late, but is this included in numpy now? Coming from a Matlab background this functions seems really important to me. A PR would be highly appreciated (not that Im one of the developers).

@tgy
Copy link

tgy commented Mar 28, 2017

I would also be interested in this.

@toobaz
Copy link

toobaz commented Apr 4, 2017

Maybe it's obvious, but since it was not mentioned: np.all() and np.any() would be probably even easier (and unambiguous for dimension > 1) to make lazy. Currently...

In [2]: zz = np.zeros(shape=10000000)

In [3]: zz[0] = 1

In [4]: %timeit -r 1 -n 1 any(zz)
1 loop, best of 1: 3.52 µs per loop

In [5]: %timeit -r 1 -n 1 np.any(zz)
1 loop, best of 1: 16.7 ms per loop

@toobaz
Copy link

toobaz commented Apr 4, 2017

(sorry, I had missed the ref to #3446 )

@roebel
Copy link

roebel commented May 31, 2017

As I a have been searching for an efficient solution of this problem for quite a while and as there seem to be no concrete plans of supporting this feature I have tried to come up with a solution that is not quite complete and versatile as the api suggested above (notably supporting for the moment only 1D arrays), but that has the advantage of being completely written in C and therefore seems rather efficient.

You find the source and the details here:

https://pypi.python.org/pypi?name=py_find_1st&:action=display

I would be grateful for any comments on the implementation notably, the question of the somewhat astonsihing performance issue when passing in boolean arrays and searchig for the first true value, that is described on that PyPi page.

@lzkelley
Copy link
Contributor

I found this from a stackexchange post that's looking for this feature, which has been viewed over 70k times. @roebel did you ever get any feedback on this? Can you just put in a PR for the feature, which might get more attention?

@roebel
Copy link

roebel commented Sep 14, 2018

no I never got any feedback, but a few people apparently used the package without problems.
BTW, for anaconda linux and macos I have made an anaconda installer

https://anaconda.org/roebel/py_find_1st

With respect to a PR I will have to look into the effort it requires me to adapt this such that can be merged easily into numpy. I wont' have the time to fight through discussions about API changes and extensions.

@jmlarson1
Copy link

jmlarson1 commented Nov 16, 2018

Does the removal of "priority:normal" mean that this important feature will somehow be given less attention?

@mattip
Copy link
Member

mattip commented Nov 16, 2018

The priority is still "normal", just without a label. The issue needs a champion to actually make a PR and push it through the approval process including documentation and hopefully a benchmark.

@mhvk
Copy link
Contributor

mhvk commented Nov 16, 2018

Probably useful here to point to #8528, which is nominally about all_equal but can be seen as implementing parts of this. Indeed, in #8528 (comment), @ahaldane explicitly suggests implementing a first reduction method on all comparison operators instead of a new gufunc all_equal.

This also means there is quite a bit of an implementation waiting to be adapted (though it is not a trivial change from a gufunc to a new reduction method, and there is the question whether we want a new method on all ufuncs, even those for which first makes little sense.

@jmlarson1
Copy link

This issue has been known since (at least) 2012. Any update on a way to prevent nonzero(A)[0][0] from searching all of A?

@yunyoulu
Copy link

yunyoulu commented Nov 7, 2019

Is it the so-called Pythonic way to always scan for all the elements?

@eric-wieser
Copy link
Member

eric-wieser commented Nov 8, 2019

@yunyoulu: It's the ufunc way. Let's take a step back and look at the general process of a multi-step computation in numpy, and the number of passes it takes:

  1. np.argwhere(x)[0] - performs 1 pass of the data
  2. np.argwhere(f(x))[0] - performs 2 passes of the data
  3. np.argwhere(f(g(x)))[0] - performs 3 passes of the data

One option would be to introduce a np.first function or similar - that would then look like the following, where k <= 1 varies depending on where the first element is:

  1. np.first(x)[0] - performs 0+k pass of the data
  2. np.first(f(x))[0] - performs 1+k passes of the data
  3. np.first(f(g(x)))[0] - performs 2+k passes of the data

The question to ask here is - is this saving really worth that much? Numpy is fundamentally not a lazy computing platform, and making the last step of a computation lazy isn't particularly valuable if all the previous steps were not.

@jmlarson1
Copy link

jmlarson1 commented Nov 8, 2019

Outdated

@eric-wieser

I don't think that's worded quite right. If k = 10 for some problem, it's not 1+10=11 passes of the data for np.first(f(x))[0]

(edited out by @eric-wieser for brevity, this conversation is already too long)

The use case where I most see the need for this functionality is when A is a large tensor with A.shape = (n_1, n_2, ..., n_m). In such a case, np.first(A) would require looking at only k elements of A instead of n_1*n_2*...*n_m (a potentially significant savings).

@eric-wieser
Copy link
Member

eric-wieser commented Nov 8, 2019

I most see the need for this functionality is when A is a large tensor

Presumably in this case you've already done at least one full pass of the data - so at best you're getting code that runs twice as fast.

@bersbersbers
Copy link

bersbersbers commented Nov 9, 2019

The question to ask here is - is this saving really worth that much? Numpy is fundamentally not a lazy computing platform, and making the last step of a computation lazy isn't particularly valuable if all the previous steps were not.

That's an interesting point of view that, if established, could be used to justify trashing almost all efforts to improve computational performance "because we are computing something else as well and that is still slow". (It's the the same argument used by deniers of climate change action - well, until this other country does something, doing something in our country won't help anyone.) I am not convinced at all. If there is any chance of speeding up some part of a computation by 1/k, with k potentially very very small, that is worth it in my opinion.

Also, when working interactively (Jupyter etc), very often you do the "passes" of the data in separate cells, so you could end up speeding up a whole cell as well.

@toobaz
Copy link

toobaz commented Nov 9, 2019

np.first(f(x))[0] - performs 1+k passes of the data

@eric-wieser indeed when I looked at this issue in 2017 I was really hoping it would be the first step towards a sort of np.firstwhere(x, array_or_value_to_compare), which is indeed a specific case - but important in my experience - of f(x).

@eric-wieser
Copy link
Member

eric-wieser commented Nov 9, 2019

@toobaz: I assume you have f = lambda x: x == value_to_compare in that example.

This is exactly the reason that I am wary of going down this path at all (cc @bersbersbers). If you're not careful, we end up with (spellings speculative):

  1. np.first(x) - save a pass vs nonzero
  2. np.first_equal(x, v) - save a pass vs first(np.equal(x, v))
  3. np.first_square_equal(x*x, v) - save a pass vs first_equal(np.square(x), v)

It should be pretty obvious that this doesn't scale at all, and we have to draw the line somewhere. I am slightly in favor of 1 being allowed, but 2 being allowed is already an explosion of API surface area, and 3 seems very unwise to me.

One argument in favor of np.first - if we implement it, numba could special-case it such that np.first(x*x == v) within a numba context actually does do a single pass.

@npexciton
Copy link

Anyway it's good to know that it's impossible to do the lazy things in numpy, which clarifies current status of the issue.

However, I don't feel comfortable when performance tweaks are considered only in scalability.

Let's ask a simple questions: are personal computers scaling today? The answer is definitely NO. Three years ago when you purchase a standard laptop, they are equipped with 8GB memory; and now you'll still find 8GB in the market. However, every software is using 2x or 4x more memory than they used to. At least the workstations are not scaling in the same way clusters are.

Making a function 10x slower without changing its complexity at all is sufficient to drive one data scientist crazy. What's worse, there is nothing elegant he can do even if the bottleneck is figured out through profiling.

What I'm trying to elaborate is that having the ability to do lazy processing is always desirable and can be crucial to the responsiveness of the system as well as the productivity of the people using the language. Difficulty or workload in library development constitutes a very good excuse of not implementing these features and is certainly understandable, but don't say they are not useful.

@toobaz
Copy link

toobaz commented Nov 18, 2019

@toobaz: I assume you have f = lambda x: x == value_to_compare in that example.

Correct

This is exactly the reason that I am wary of going down this path at all (cc @bersbersbers). If you're not careful, we end up with (spellings speculative):

1. `np.first(x)` - save a pass vs nonzero

2. `np.first_equal(x, v)` - save a pass vs `first(np.equal(x, v))`

3. `np.first_square_equal(x*x, v)` - save a pass vs `first_equal(np.square(x), v)`

I understand your concern, but I would never ask for np.first_square_equal precisely like I would never ask (and nobody, I hope, asked) for np.square_where. And yes, I do see it means making a full pass of the data if you do 3. But v is created once, and I might need to look for many different values of x over it. For instance (coming back for simplicity to example 2.), I want to verify whether all my 30 possible categories appear in my 10^9 items array - and I strongly suspect they all appear among the first 10^3 elements.

So first let me clarify my previous comment: I would like np.firstwhere(x, array_or_value_to_compare) as a function which meets my intuition, but the computational issues I had back in 2017 would have been solved even just with np.first.

Second, the point is - I think - not just of running time of the single call. It is true that I need to make a full pass of the data anyway to do 2. and 3... but maybe I already did this pass when I initialized the data, and now I'm really looking for a way to speed up a frequent operation.

I see your point that np.first really deviates from the standard numpy approach, I see it could be nontrivial to implement well... what I don't see is how it would "infect" the rest of the API, or grow an own large API.

This said, if instead you think it really is beyond the numpy scope, maybe there is scope from a small independent package instead.

@roebel
Copy link

roebel commented May 17, 2020 via email

@ez2rok
Copy link

ez2rok commented Dec 29, 2021

I think it would also be useful to not just return the first nonzero element but perhaps the first n non-zero elements.

@jmlarson1
Copy link

I think it would also be useful to not just return the first nonzero element but perhaps the first n non-zero elements.

Yes, it would be useful for first n non-zero elements. But just the n=1 case would be very nice.

@WarrenWeckesser
Copy link
Member

FYI: A while back I added the gufuncs first and argfirst to ufunclab.

@NelsonUpenn
Copy link

NelsonUpenn commented Jan 15, 2023

Very sorry, but newbie would like advice on how to install argfirst. I went to ufunclab on github, downloaded all, and tried running
$ python setup.py -install
with Anaconda Python 3.8.1 installed. This did not seem to generate any ufunclab.py, so when I later tried
$ python ufunclab/tests/test_first.py
I got ModuleNotFoundError: No module named 'ufunclab'
Surely I am missing something obvious.

@WarrenWeckesser
Copy link
Member

@NelsonUpenn, ufunclab is a personal project of mine, and is not affiliated with the NumPy project. If you have questions about it, you can create an issue at https://github.com/WarrenWeckesser/ufunclab.

@jasondet
Copy link

jasondet commented Jan 30, 2023

If one expects that one of the early elements is nonzero one can find it semi-efficiently with a while loop that calls np.where() in slices of increasing length:

n = 1
while n < len(A):
    indices = np.where(A[n-1:2*n-1] != 0)[0]
    if len(indices) > 0: break
    n = n << 1
print(f'first nonzero at index {n-1 + indices[0]}')

@NelsonUpenn
Copy link

NelsonUpenn commented Jan 30, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests