Path: news.service.uci.edu!csulb.edu!canoe.uoregon.edu!logbridge.uoregon.edu!netnews.com!nntp.abs.net!uunet!dca.uu.net!news.baymountain.net!not-for-mailFrom: Tim Peters <tim.one@comcast.net>Newsgroups: comp.lang.pythonSubject: RE: Newbie: finding the key/index of the min/max elementDate: Sun, 05 May 2002 05:22:04 -0400Organization: NoneLines: 136Message-ID: <mailman.1020590831.7588.python-list@python.org>NNTP-Posting-Host: mail.python.orgMime-Version: 1.0Content-Type: text/plain; charset=Windows-1252Content-Transfer-Encoding: 7BITX-Trace: news.baymountain.net 1020593547 11922 63.102.49.29 (5 May 2002 10:12:27 GMT)X-Complaints-To: abuse@baymountain.netNNTP-Posting-Date: 5 May 2002 10:12:27 GMTIn-reply-to: <74rA8.41449$8D3.1220071@news1.tin.it>X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2600.0000X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)Importance: NormalX-Priority: 3 (Normal)X-MSMail-priority: NormalX-Spam-Status: No, hits=-6.0 required=5.0 tests=IN_REP_TO,BODY_PYTHON_ZOPE version=2.11Errors-To: python-list-admin@python.orgX-BeenThere: python-list@python.orgX-Mailman-Version: 2.0.10 (101270)Precedence: bulkList-Help: <mailto:python-list-request@python.org?subject=help>List-Post: <mailto:python-list@python.org>List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,	<mailto:python-list-request@python.org?subject=subscribe>List-Id: General discussion list for the Python programming language <python-list.python.org>List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,	<mailto:python-list-request@python.org?subject=unsubscribe>List-Archive: <http://mail.python.org/pipermail/python-list/>Errors-To: python-list-admin@python.orgX-BeenThere: python-list@python.orgX-Original-To: python-list@python.orgXref: news.service.uci.edu comp.lang.python:199582[David Eppstein]> ...> I'm pretty sure it would usually be better just to sort the list and> return the middle item, instead of using a linear-time median> finding algorithm, despite the nonlinear big-O time of a sort.Many years ago I had to write a linear-time selector in Pascal.  It was sohorridly painful I never tried it again.  But I'm old enough now to confrontmy fears <wink>, so I tried it in Python, and was pleasantly surprised athow easily it went.  Its performance isn't bad, considering it's written inPython, and how much effort went into speeding Python's C sort.  Here arethe seconds each took for finding the median of various sizes of arrays ofrandom doubles (these are the means of 3 runs at each size):       n  clever    sort -------  ------  ------       2   0.000   0.000       3   0.000   0.000       5   0.000   0.000       9   0.000   0.000      17   0.000   0.000      33   0.000   0.000      65   0.001   0.000     129   0.002   0.000     257   0.002   0.000     513   0.004   0.001    1025   0.008   0.002    2049   0.018   0.004    4097   0.037   0.009    8193   0.088   0.020   16385   0.166   0.047   32769   0.303   0.106   65537   0.638   0.236  131073   1.276   0.528  262145   2.579   1.152  524289   5.245   2.523 1048577  10.674   5.460 2097153  22.063  11.863I got bored then.  This is using -O with current CVS Python, which issomewhat zippier than currently released Pythons.  For contrast, the Pythonversion doesn't take much longer to find the median than it takes just tofill the array with random doubles (random.random() is coded in Python too).There are obvious but unpleasant ways to speed the Python version (it does alot of data copying), which I'll skip.exorcising-old-demons-ly y'rs  - tim# Find the rank'th-smallest value in a, in worst-case quadratic time.def short_find(a, rank):    a.sort()    return a[rank - 1]# Find the rank'th-smallest value in a, in worst-case linear time.def find(a, rank):    n = len(a)    assert 1 <= rank <= n    if n <= 7:        return short_find(a, rank)    # Find median of median-of-7's.    medians = [short_find(a[i : i+7], 4) for i in xrange(0, n-6, 7)]    median = find(medians, (len(medians) + 1) // 2)    # Partition around the median.    # a[:i]   <= median    # a[j+1:] >= median    i, j = 0, n-1    while i <= j:        while a[i] < median:            i += 1        while a[j] > median:            j -= 1        if i <= j:            a[i], a[j] = a[j], a[i]            i += 1            j -= 1    if rank <= i:        return find(a[:i], rank)    else:        return find(a[i:], rank - i)def tryone(n, tries):    from random import random    from time import clock as now    x = [None] * n    median_rank = (n+1) // 2    sum1 = sum2 = 0.0    for i in range(tries):        for i in xrange(n):            x[i] = random()        y = x[:]        start = now()        got1 = find(x, median_rank)        elapsed = now() - start        sum1 += elapsed        start = now()        got2 = short_find(y, median_rank)        elapsed = now() - start        sum2 += elapsed        if got1 != got2:            print "OUCH got1 %r got2 %r" % (got1, got2)    return sum1, sum2def drive(tries):    for i in range(23):        n = 2**i + 1        fast, slow = tryone(n, tries)        print "%8d %7.3f %7.3f" % (n, fast/tries, slow/tries)if __name__ == "__main__":    drive(3)PS:  If you're more interested in expected time than worst-case time,replacing the median-of-median-of-7s business with        median = short_find([a[0], a[-1], a[n//2]], 2)yields a Python median-finder that's usually significantly faster than thesort method starting in the range of 512K to 1M elements.PPS:  If you do care about worst-case time, the Python version can be madesignificantly faster by boosting the median-of-7 gimmick to median-of-k forlarger odd values of k.  Fun for the whole family:  plot speedups against kuntil you're all baffled <wink>.