Talk:Binary search algorithm

From Wikipedia, the free encyclopedia

[edit] hey, no checking if its there?

I dont think this code accounts for what might happen if you search for something thats not there. I am not sure though.

[edit] Example doesn't work

The guessed Number might be 11 or 12! so better ask an additional question or change the comparision to "greater or equal"

[edit] Definition Clarification?

http://www.nist.gov/dads/HTML/suffixarray.html

The reason I bring this up is because the opening definition of binary_search is because it doesn't include a clear definition "that characteristic" used by the "sort algorithm".The charicteristic used would be the order of the data, or the orderability of the data. [E.g. a "recent changes" page can be ordered alphabetically, or chronologically. This type data two inherently sequential properties, alpha and chron., both linear arrays] I'm neither a programmer nor mathematician....=/ Suffix array is something special, too. more on that

Posted in wiki: In computer science, binary search or binary chop is a search algorithm for finding a particular value in a list of data. A divide and conquer algorithm, binary search requires random access to the data being searched, the ability to quickly get the kth item in the list. In its simplest form, binary search assumes the data is sorted (usually by a sort algorithm) and takes advantage of that characteristic.

http://www.nist.gov Definition: Search a sorted array by repeatedly dividing the search interval in half. Begin with an interval covering the whole array. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise narrow it to the upper half. Repeatedly check until the value is found or the interval is empty. -links to- [is an example of a:]

dichotomic search definition: Search by selecting between two distinct alternatives (dichotomies) at each step.

Data Structures, Algorithms Binary Search Algorithm

This Binary Search Algorithm uses recursive subdivisions of an array to perform a search. Enter the size of the array in the text field, and click the "Create" button. The array is created with randomly generated numbers. Now, input a search key in the text field, and click the "Search" button. http://www.cosc.canterbury.ac.nz/people/mukundan/dsal/BSearch.html

These two definitions use the term ARRAY. Answers.com clears this up: ARRAY:

Mathematics.

  1. A rectangular arrangement of quantities in rows and columns, as in a matrix.
  2. Numerical data linearly ordered by magnitude.

- Ieopo 04:03, 28 October 2005 (UTC)
  - An array doesn't have to be ordered. An array is a data structure pretty much like linked-lists and queues are. The array exist before being applied a sorting algorithm on it (how can it otherwise be ordered?). Moreover nothing says you need to apply such a sorting algorithm for it to be an array. So the definition of array shouldn't include order. There exist the term ordered array though. --Agcala 16:51, 1 April 2007 (UTC)

To be an encyclopedic definition it is somewhat obscure. I think it hasn't been stressed enough that the array must be sorted on the ascending key values, or the algorithm won't work.--Agcala 16:51, 1 April 2007 (UTC)

[edit] Pseudo code alternate version

  // Alternate version:
  BinarySearch(A, value) {
                                            // pre: 0 < N and A is ascending
      i = 0                                 // invariants:
      j = N                                 // 0 <= i < j <= N and f.
      while (i + 1 != j) {                  // 0 <= i + 1 < j <= N
          h = (i + j) / 2                   // 0 <= i < h < j <= N
          if (A[h] <= value)        
              i = h                         // 0 <= i < j <= N  
          else
              j = h                         // 0 <= i < j <= N
      }                                     // 0 <= i < j <= N and i + 1 = j
      return A[i] == value                 
  }

Only one exit point, overflow problems are not a problem for pseudocode. Some might argue there is a downside as it doesn't return the index of the found item (for which the algorithm is easily adapted by returning not_found or the index i), it doesn't work with empty arrays (for which the algorithm is easily adapted by returning not_found), and it is less efficient in that it continues searching the array after an item has been found (and it is MORE efficient in a worstcase scenario, just count the number of times a comparison is made). Question not_found is not a value and usualy -1 but this is only specific for the C language.

overflow might be fixed h = i / 2 + j / 2 + i mod 2 * j mod 2

[edit] Pseudo code change

I changed the pseudocode since it was incorrect.dze27

Pseudocode:

function binary-search(L,V)
  set start = 1
  set end = N
  repeat while start <= end
    set middle = (start + end) div 2
    if V = L[middle]
      return success    
    else-if V < L[middle]
      set end = middle - 1 
    else-if (V > L[middle])
      set start = middle + 1
    end-if
  end-repeat
  return failure
end-function

Notes: 
 div is integer division (discard any remainder)

In practice, the non-recursive algorithm would be implemented with some minor changes: Robocoder 15:02, 12 November 2005 (UTC)

To avoid overflowing (e.g., left+right > maxint), calculate:

mid := floor(left+(right-left)/2)

Searching the top half of the array is implied and does not require the conditional test:

if value > a[mid]

In pseudocode we assume that integers are infinite-precision, for the sake of simplicity, but you are correct. Similarly, the redundant conditional test exists solely for clarity. Pseudocode is designed for people to read. Deco 21:53, 12 November 2005 (UTC)

[edit] Bridging ideas

Has anyone else noticed that the part with the example of guessing numbers between 1-16 and the part with the pseudocode etc, aren't very well connected? Someone who don't already know what binary search is, and how it works, might not make the connection?

Anyone have any ideas how to express how the _algorithm_ for 'guessing' the right number in the 'game' can be used to find the position of a certain number in a list?

Good point. I tried to add some transition and tie them in somewhat, but they're still a bit separated. I hope this helps. Deco 07:27, 26 Dec 2004 (UTC)

The pseudocode is pretty intense for me - it seems more complex than it may need to be. I think the number guessing game should include a formula* for finding for finding the smallest number of steps required to resolve any unknown number. [The "50 in 6 steps example" should be accented with a "100 in 7 steps example" to give scope of the power of the algorithm]

Formula? I can't find the formula or make one - I keep coming up with more code than math [e.g. if NOT real number then]

[edit] Real-world programming language example

I would as well appreciate an implementation form in a real-world programming language -- HJH

Wikipedia isn't a code repository. The pseudocode sufficiently explains the algorithm. Please search and insert an external link. — 131.230.133.185 01:37, 13 July 2005 (UTC)

At first I was quite alarmed by this change, mostly because the reason used to make the edit could be applied to unilateral deletion of code samples from all articles, which would be very bad and highly controversial. But in this particular case, I believe you're right — the samples are almost identical to the pseudocode in form and function. No need for redundancy. Deco 04:12, 13 July 2005 (UTC)

[edit] Recursive Change

I changed the code for the recursive version because it wouldn't have worked as written. The "return mid" statement would have returned the value to one of the recursive calls rather than the initial, external caller. 69.171.89.130 21:48, 4 February 2006 (UTC)

Right you are. Good catch. Deco 23:42, 4 February 2006 (UTC)

[edit] Pseudocode examples will not work

hi, I think that the algorithm will not work if programmed as described in the pseudocode examples. The values of the right-variable are never checked if they are equal value. Therefore you can never find the value 'b' in the list 'a','b'.

I think you're incorrect. Because we add 1 to mid during each recursive call, it must continue to get larger with each call, so eventually the range will only contain "right". Deco

It's possible to write this algorithm in a way that avoids special-case testing, and confusing indexing with a +1 here and perhaps not there. It would also be good to perform a single comparison at each stage (the actual comparison might be via function evaluation) though few computer languages enable this. Suppose an array A, with elements 1 to N to be searched to find an index i such that A(i) = X. In a pseudocode fragment, converted from a working programme written in accordance with Professor Knuth's version,

        L:=0;                             %Establish outer bounds.
        R:=N + 1;                         %One before, and one after, the first and last.
  Probe:P:=(R - L)/2;                     %Truncate to integer. Beware integer overflow with (L + R)/2.
        if P <= 0 then Return(NotFound);  %Aha! Nowhere!
        P:=P + L;                         %Probe the middle of the span L:R.
        Case Sign(X - A(P))               %Perform the comparison once.
Positive:L:=P, go to Probe;               %Shift the left bound up:    X follows A(P).
Negative:R:=P, go to Probe;               %Shift the right bound down: X precedes A(P).
    Zero:Return(P);                       %So, X is found, here!       X equals A(P).
        Otherwise;                        %No other cases are possible.

Now then, this is not C, it is pseudocode so be brave. What I mean by Case Sign(expression) is to consider the sign of the expression's value, which has three cases. In most languages you are stuck with something like if expression > 0 then ... else if expression < 0 then ... else ...; or some such permutation of the three cases. This means that on average half the time, two comparisons with be performed at each stage, not one. But in fortran, you can write IF (expression) negative,zero,positive meaning go to the appropriate label for the three cases, and obviously, the expression is written (and hopefully, compiled to execute) once only.

Notice that there is no special case testing for annoyances such as N = 0 or N = 1, or X being outside the bounds of A(1) to A(N), which are particularly annoying to check if N = 0. These cases are all encompassed in the one method, and it is very easy to make mistakes with half-remembered details. Further, this code contains usage of the deprecated go to statement, but I invite those who recoil to present a revision that avoids wasted effort or replicated code and blather splattered across the page. NickyMcLean 21:07, 13 December 2006 (UTC)

Regarding NickyMcLean's broken changes from 19:33, 15 February 2007: A is a sorted list with range 1 to N. Consider a single item list (i.e. N=1): low will be initialsed to 0, high to 2 and the first probe, p, to 1. That is the only valid index in the list, and yet if that doesn't match the loop won't end. Your next set of values are either high=1, low still 0, and p=0 (out of bounds fault), or low=1, high still 2, and p still 1 (an infinite loop). The while condition MUST be (low < high-1) for this form of the algorithm to work. j4_james 22:55, 15 February 2007 (UTC)

Yep, you're right, your fix for the C-pseudocode was correct and I misread it. In the above pseudocode the test remains the simple P <= 0 (to stop) rather than low < high - 1 (to continue) in the C-pseudocode formulation. Apologies for the burp in my skull sludge. NickyMcLean 00:00, 16 February 2007 (UTC) PS: Neat timestamp! And after I had lunch (2pm in NZ) I remembered how I coded the binary chopper in assembler.

I see that the C-pseudocode is still unstable, as Mariolj claims an error fixed if the sought item is not in the list. But after my previous mistake, I'm not going to look again! And all the time, the above pseudocode remains compact and correct, and is a slightly-edited copy from an actual working prog. which has worked correctly when confronted by a search for a value not in the list. But the pseudocode is not in the C-style... NickyMcLean 04:08, 21 February 2007 (UTC)

I can assure it's now broken again. There are two common forms of the algorithm - one with the low and high bounds outside the test range (start with low = 0 and high = N+1) and one with the bounds inside (start with low = 1 and high = N). The problem is that people tend to confuse the two which is how you end up with the disastrous hybrid that we have now. It's easy to prove that it's broken with a simple walk through of two elements. I couldn't care less what format the pseudocode is in, but it would have been nice if it worked. j4_james 19:15, 21 February 2007 (UTC)

[edit] Integer Overflow

Prompted by MuthuKutty the remark "Beware integer overflow" can be expanded upon without adding further blather to the proffered pseudo-C. When writing pseudocode, it is usually supposed that any variable is capacious enough to encompass the largest value or greatest precision necessary for the method so that messy details will not obscure the technique, but in actual implementations there are limits, and they are close. Suppose 16-bit signed integers are used: clearly the indexing of the array will work only up to it having 32,767 elements. Alas, the calculation of P:=(R + L)/2; can fail long before that, because the sum can easily exceed the 16-bit limit as when the array has more than 16384 elements and the area being probed is close to the high end of the array. This can be fixed (almost) by making the expression slightly larger, and accepting the extra execution time: P:=(R - L)/2 + L; (This order has hope of slightly faster execution than L + (R - L)/2, since no temporary storage is needed to hold the value of L while the (R - L)/2 part is being computed)

But only almost. Suppose the array has indeed 32,767 elements. Now remember that the initial values of L and R are to be one outside the bounds, so R must be able to hold 32,768, and it can't in 16-bit signed integer form. Similarly, suppose that unsigned integers are to be used: all well and good, and the same problem exists with a maximal upper bound. But suppose that the array indexing starts with zero (as in C, by requirement) rather than one for whatever reason. The initial value of L is to be one less than the first index, and -1 is not a possible value for an unsigned integer.

A further possibility exists. Most computer languages do not offer a protocol for a three-way test on the result of a comparison so the code is usually some lame repetition of the comparison such as

  if A(P) < X then L:=P
   else if A(P) > X then R:=P
    else Return(P)

Suppose instead this is avoided via

  diff:=A(P) - X;
  if diff < 0 then L:=P
   else if diff > 0 then R:=P
    else Return(P)

With the hope that the compiler will optimise the repeated access of variable diff using some sort of temporary register. (A compiler is very unlikely to optimise the repeated comparison code, not least because the code is not exactly repeated; one has < while the other has >) Alas, if the comparison is performed on variables with limited capacity, such as integers, this can fail because the difference overflows the integer limit as in (for 16-bit) 30000 - (-15000). The explicit comparison code will (should!) succeed, because the compiler writer will (should!) have taken advantage of such machine indicators as overflow and carry which were set by the subtract operation, which states are not carried along into the value of variable diff. In the absence of a three-way test syntax, only assembler programming will produce good code. NickyMcLean 20:46, 12 February 2007 (UTC)

Nicky, look up bsearch() in the C Standard Libarary. There is no compare function. There is only a function pointer to a compare function. If you want to compare strings then use strcmp() and it will return 1, -1, 0. This exact same set of values must be returned by any comparison function called through the function pointer specified to bsearch() as a function having two void pointers as its arguement.

int CmpInt( int *piKey, int *piArrayMem) // bsearch() might complain about typing here

if *piKey < *piArrayMem return -1; // any negative value is sufficient/legal

if *piKey > *piArrayMem return +1; // any positive value is sufficinet/legal

return 0;

// for searching arrays of structs use a compare function like this

int CmpMyStruct( int *piKey, MyStruct *pArrayMem) // bsearch() might complain about typing here

if *piKey < *pArrayMem->structmem return -1; // any negative value is sufficient/legal

if *piKey > *pArrayMem->structmem return +1; // any positive value is sufficinet/legal

return 0;

As you can see, there is no "annoying non-uniformity" between the comparison of the simple data types, including char arrays(strings), and complex, user-defined data types defined via struct{}s. There are much more clever ways to search struct{}s, such as using offsetof() to move the array's base pointer over so the compare function can dispense with the need to deal with the ->structmem offset - in which case the original simple datatype compare function would work just fine.

The notion of passing a function as a parameter is not troublesome. More interesting would be to supply an expression as a function parameter. But I offer two points: firstly, I wasn't restricting myself top talking only about C (and its variants) so there is not a lot of point in extolling whatever useful attributes some variant of C might offer in this context (which is not restricted to C-languages), and second, look at your remarks "any negative value" (or positive value) is allowed. Well, the whole point of my argument is that the result of a comparison should be usable in a three-way test, as in case(Compare(a,b) where the recognised cases have to be explicitly the values -1, 0, and +1 otherwise the case-statement fails.

You seem to have totally forgotten or confused the context of a compare function where bsearch() is concerned. The return value and function arguments MUST conform to the function prototype specified by bsearch() and the test you seem to be referring to in case(Compare(a,b) is being done INSIDE the bsearch() compiler-supplied function. This is not something you are writing nor have the ability to modify. Furthermore, the reason that -1,0,+1 are NOT specified as the only legal return values for a compare function is precisely so you can write a compare function that returns the element number (or your mother's birthday for that matter) if you wish to do so, provided you managed the sign to also satisfy bsearch()'s requirements of a compare function. I have no idea how you think you are going to embed such interpolation code INSIDE of the compiler-supplied bsearch() call, but your comments seem very confused in general. My guess is you are a student and have little real-world practice to call on. I've tried to be helpful, but really, your comments are largely disconnected, out of context, unrelated and random streams of consciousness.

--Solidpoint 00:43, 3 April 2007 (UTC)

It may be that C (and other languages) might offer range-like conditions for case-matching but this is now a second-level embedment in the vagaries of C alone. I fully agree that a comparison routine should return -1,0,+1 appropriately, however those interested in interpolative searching would be keen on a numeric value for the size of the difference. Thus the usage would be Case(Sign(Compare(a,b))) in the binary search, and Compare would have to be a routine able to compare many (all?) types of data without change of name. NickyMcLean 21:51, 1 April 2007 (UTC)

Nicky, you are sure making my point about why high-level, abstract languages suck. They tend to make you forget what is going on "under the covers". First, if you really knew anyting about C or assembly language you would know to just examine the sign bit in the compare and be done with it. Duh! Second, a binary search, such as bsearch() is not an interpolation search and I don't care what the requirements are for that since that is not the problem being solved here. You seem intent on making some point about a ternery test where none is required. C in fact produces perfect or very near perfect assembly language for compare functions, which it should since they are silly simple, so you are burning a straw man. Finally, bsearch(), using exactly the same syntax, does allow you to search any type of data, but you are so tangled up with C++ and god knows what you keep missing the point. In most cases the calling function simply specifies the appropriate compare function pointer value so that when bsearch() is called in that function's context it already "knows" which one to call. Simple and elegant and the pointer value can be ebedded right in the bsearch() statement so no endlessly distracting C++ lines of code needed to "set up" the call. As for this page, C does have an advantage in being a very sparse syntax so there would not be a lot of language-specific knowledge needed for the reader to get the gist of how a binary search is implemented.

With that caveat, I tend to agree that a definition of a binary search should not be language specific. In theory it could have uses other than those in software, but as a practical matter software is the dominant use and there are already good implementations of this functionality in most languages, including C, so the only thing left to discuss where C and bsearch() is concerned is the compare function. As for the way that C return values are handled in non-C languages, it's a problem that can never be manifest. If you actually sit down an write a C program that can search 5 different data types, including a few struct{}s, you won't need to read this to "get it". In it's defense, the great thing about C is it is a WYSIWYG language. No need to spend hours or days trying to figure out what the heck is going on "under the covers" or why something is taking forever to execute. In C its right there staring you in the face the whole time. Advantage C. --Solidpoint 00:17, 3 April 2007 (UTC)

--Solidpoint 08:33, 31 March 2007 (UTC)

Nicky, Only a real dipshit programmer would deal with data range issues down at this level in the code. Any seasoned programmer will trap those errors at the data entry or file read level and never, ever would let stuff 3-9 levels down like this deal with those issues. I also have to say that I usually read the assembler that the compiler spits out (or library calls) when I write something this low level and the compiler does produce good assembly language - at least in C. I'm just back from the gym and will have to read your compare function, but in general, it is handled by a function pointer in C and the compare function, which can be arbitrarily complex btw, should test for greater, less than and default to equal as it is the least likely case. Using return statements in the compare function are very efficient because you don't have to exit the function in a uniform way, you can just jump back to the caller with the value supplied.

I'm not too clear what you are on about. The reference http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html by J. Bloch clearly describes the problem. Large collections being searched soon exceeded the span of 16-bit integers, but no problem, everyone relaxed into 32-bit integers which are so big that no-one would ever have a collection of more than a thousand million items to be searched. So, in order to save a trivial action, (L + R)/2 was favoured over (R - L)/2 + L in whatever high-level language was used, and this has its effects all the way down to bit flipping and will cause the high-level system to fail. As for the comparison of keys, again the wording of the high-level language has its effects. In this case, if if-statements are to be used, the binary result of what should be a ternary condition forces a wasteful repetition. As I understand it, C does offer a strcmp function that returns a three-state result, excellent, it may be used in a case statement or, via diff:=strcmp(a,b) the result may be inspected via two if-statements without repetition of the comparison process itself. However, strcmp is for comparison of text strings only. In principle, any ordered datum should be comparable, as is offered by syntax involving < and >, but they only offer binary comparison. Thus there is the annoying situation whereby one syntax can be used for many types (integers of various sizes, floating point of various sizes) but another must be used for string variables. One could imagine a compare(a,b) function that deals with whatever types a and b present and returns a ternary state. This is what I was trying to imply with sign(a - b), granting a generalisation of subtraction in this context. In the absence of such a multi-type comparison function (hopefully compiled with attention to context), various different formulations might be tried in a high-level language, with potential problems as described.NickyMcLean 23:01, 27 March 2007 (UTC)

I'm not sure who this should be addressed to, but recursive functions work just fine on Intel processors, which have dedicated hardware for building and destroying stack frames. Also, since the whole point of the binary search is that you don't need very many probes the stack frame will never get very deep. If you are building a database engine you might want to optimize with a loop, but the performance hit for recursive functions is not that great. I would encourage you to write a "do nothing" function that calls itself and benchmark it - assuming the compliler isn't so smart as to optimize it into a loop anyway, you will have a better feel for how much performance is suffering. --Solidpoint 07:18, 10 March 2007 (UTC)

[edit] Where does the name come from?

Why is this called a "binary" search? What's binary about it? Is it that binary means "can have two values" and here the key can be in one of two places (below or above the midpoint)? Or something about splitting the array in two, and two == binary? Seems pretty weak, but I can't think of a better reason. --Lkesteloot 04:02, 28 June 2006 (UTC)

It's because it splits the array in two. That's really it. Deco 04:29, 28 June 2006 (UTC)

[edit] This page needs some serious work

Would anybody object if I gave this page a thorough working over? There are so many ideas mixed together here, and many which are missing, or left unexplained, that, on the whole it's just pretty awfull. I've been a professional software engineer and programmer for 25 years or so now and wrote a database engine a few years ago where I used this algorithm extensively. I already know what this thing is, so I don't really need a page on Wiki to tell me about it, but I am willing to have a go at fixing it so long as you are all willing. --Solidpoint 07:07, 10 March 2007 (UTC)

Why not? (Presumably you mean the article, not the the discussion) - I've tried to present a simple and clear pseudocode, but its non-C nature offended the eyes of some C-lovers and was promptly converted to (incorrect) C, followed by a series of fixes to the mistakes that others then altered by further fixes that created errors (and I bungled one too, alas), and I've given up trying to check that today's version of the pseudocode actually does perform a binary search of elements 1 to N of an array as declared in the text, that it reports "not found" as appropriate, that it succeeds even when N = 0 or 1 as well as more sensible ranges, and it doesn't fail for large N because it doesn't use (L + R)/2. Good luck. NickyMcLean 22:19, 27 March 2007 (UTC)

Nicky, Having thought about the scale problem of (L+R)/2 I think this is a serious problem that should be addressed - a good catch. Before I tackle this I need to get an old machine plugged in, up, and running and don't have time for that right now. I think it would be great to have a good C code here, and one that works. I could also just cut and paste the commented code from Microsoft's C library, at least if I could get permission. A better approach might be to describe a test data set that will quickly test for known failure modes. This kind of low-level functionality really should be written in C or assembly language so we should step up to the plate and get a good C function going. --Solidpoint 07:54, 31 March 2007 (UTC)

Actually, it was MuthuKutty who triggered my memory. I recall using the L + (R - L)/2 form and being criticised for it by fellow students who advocated (L + R)/2 because it was obviously faster. Since we were all using a system with 39-bit integers the overflow problem was not near and I was encouraged to not worry about something that couldn't happen. On a different system I later got away with using the (L + R)/2 form with 16-bit integers, because the arithmetic was being conducted in 32-bit registers. At the time I was quite conscious of the array size because the compiler at first would not allow arrays to be declared with a bound larger than 32767 (so the maximal array allowed had bounds -32767:32767) even though 32-bit arithmetic was being used. NickyMcLean 21:30, 4 April 2007 (UTC)

[edit] Having a go at it

Well, I decided I could at least make a dent in this, so I started at the beginning. If someone can tell me why the page is in Purgatory I will leave it alone. Otherwise this thing is just too ugly to suffer upon the world. :D --Solidpoint 01:52, 3 April 2007 (UTC)

[edit] moron's view point

Say I have a list huge, and even if I do know the size of the list, one would generally match patterns in the list.

Take this case: I have 1 million names on a list and I want to find "bob" in it. I just go around looking for "b" randomly, if i see nothing but lots of z---g in front of me, i get to g, go down and so on. In other words would simply hitting random numbered indices far apart be worse off than going down by 1/2 ?

So having a second index helps, does it not?

One could also divide the set N into 3 equal parts and keep dividing that further into 3 equal parts and so on. think of a square and you need to find a point in the square, you can split the square into 2 equal parts and keep splitting till you find the point, you can split the square into 3 equal parts and keep splitting further, so you get N/3 N/9 N/27 and so on, much better than N/2 N/4 ... ?

—The preceding unsigned comment was added by 220.227.207.194 (talk) 11:08, 4 April 2007 (UTC).

Well, better would imply you knew the cost of these additional entanglements. My gut tells me that if you did the performance would be awful. Back in the day we spent an enormous amout of time working these things out so I doubt there are any big payoffs laying around to be scooped up by a noob. It's a lot of fun trying though, if you have the stomach for a lot of dissappointment. If you can write a solid, self-balancing Red-Black tree you will have a much better idea of what motivates my comments here.--Solidpoint 05:06, 13 April 2007 (UTC)

http://en.wikipedia.org/wiki/Ternary_search so ternary exists too and seems to be faster than binary..

If you have a million items in a sorted list, the binary search process will report a result of found here, or not found, in no more than twenty probes because 2²⁰ > 1000000. For large lists, the storage may well be of slow access (a disc drive) either as an explicit disc file, or via virtual memory. In this case, it might be worthwhile to keep a small second index table of a dozen elements or so (small enough to be in fast memory) to be searched first to find a good starting range for the binary chop. There are all manner of variations possible in general, that might suit a particular case especially well. But, for an unchanging list, the initial probes of the binary search will always be to the same elements (the first is always at index N/2, then the second probe will be at either N/4 or 3N/4, etc.) and the disc file buffering scheme (or virtual memory controller) will be keeping frequently-accessed data in fast memory. So, let it do so and avoid the bother of replicating its activity with regard to your data.

A random starting point might better be replaced by starting where a previous search succeeded, or you might want to consider the Interpolation search as a possible process. Splitting the span into thirds is good, and that would perhaps be called a ternary chop or ternary search, but, how is your method going to select the right third to proceed with? With a test at the 1/3 element and the 2/3 element? But in two tests, the binary search will reduce the span by a factor of four, which is better than a factor of three. Anyway, this article is about the binary search. NickyMcLean 21:14, 4 April 2007 (UTC)

Well, some thoughful ideas, but I can tell you from real-world experience that the binary search's KISS approach is savagely fast. Most of it's speed is achieved right at the start where it is discarding huge numbers of list entries in a few steps. Imagine, for example, that you had a list with every person on earth's id on it. In the very first split it would rule out over 3 billion list members! Microsoft's lib code for C actually ends the bsearch() by executing an lsearch() as the tail end of a binary search isn't very effective. All kinds of approaches can be construed, but if you really want to evaluate new ideas you have to build a benchmark and find out how many clock cycles it takes a reference processor to execute your idea once distilled down to code. A good example is the Interpolation search. It has a ton of intellectual appeal, but I have never been able to get it to perform well in the real world on even favorable data sets. Hash tables used to rule, but they perform badly when they hit real-world hot spots like a list of names with "Smith" or "Jones" or "Gomez" in them.

Your Smith and John analogy makes no sense whatsoever in this context. Binary search has been mathematically proven to be inferior, try it on a sheet of paper in front of you, if not on a x86 comp. it is the architecture of the comp that limits you today. you never look at half the list etc, you always tend to look at random points. Even when looking up words in a dictionary. It would be crazy to look at "half way in the dictionary" when looking for a word. now if you had 3 registers to do a cmp with, you would tend to look at 3 points, on a window register case life would be fantastic. —The preceding unsigned comment was added by 220.227.207.194 (talk) 06:37, August 23, 2007 (UTC)

Personally, I keep wanting to put a search decision matrix up on this page to tell noobs when to use what kind of search or associative data structure.--Solidpoint 05:06, 13 April 2007 (UTC)

This would belong at search algorithm, not here. Disk-based searches do use more sophisticated algorithms; see database index. Dcoetzee 20:10, 11 May 2007 (UTC)

[edit] Don't delete the recursive implementation

It appears that someone decided long ago to delete the recursive implementation, which I consider illustrative, and in languages with tail-recursion elimination is even efficient. I'm going to add this back. This is not mainstream C on mainstream platforms, it's an article explaining how an algorithm works - stop microoptimizing, please. Dcoetzee 20:54, 2 May 2007 (UTC)

I also moved a bunch of details regarding three-way comparisons to their own article, three-way comparison, eliminated irrelevant differences between the implementations like variable naming and (some) syntax, and hope to condense this article more in the future. Dcoetzee 21:36, 2 May 2007 (UTC)

[edit] Problem with the code example

I have changed one of the code examples:

 high = mid; /* Should be high=mid and not high=mid-1 */

is not correct; it should be

 high = mid - 1;

The original code could enter an infinite loop - consider a one-element array (low = 0, high = 0) where A[0] equals value.

The new version can be proved to be correct; I am not sure whether it would be appropriate to include the proof here.

ICrann15 08:35, 14 July 2007 (UTC)

[edit] We should always test our code

I was so hoping that you were right about high=mid-1 because it would mean it was slightly more efficient. But unfortunately it isn't correct.

The previous code line

   high = mid;

(which I've just restored it to) never did enter an infinite loop and it has been thoroughly tested. With your code (high=mid-1), having an array of [0, 1, 2] and searching for 1 didn't work. (Forgive me if my tests or testing code is wrong -- that would be really bad!)

Here is the correct code (in the form I'm using):

   low = 0;
   high = length - 1;
   while (low < high){
       mid = (low + high)/2;
       if (A[mid] < value)
           low = mid + 1; 
       else high = mid;
   }
   if (A[low] == value)
       return low;
   else return -1;

So a one-element array would make low = 0 and high = 0, so the while-loop would be completely skipped and no infinite loop would occur.

I have added an interesting section (interesting, but not because of me -- I don't intend to sound pompous) called "Correctness and Testing", which explains how difficult it is to correctly code a binary search: most if not all programmers have done it incorrectly at some point. I've added my own testing code that thoroughly tests a binary search. It probably should have been added a long time ago.

I do like the extra advantages that you pointed out, though.

For anyone reading this, let's try to make this a better world (especially the programming world) by using the scientific method and proving things, and not just passing around code we get from somewhere else. It would be great if all of the incorrect binary searches (and such) were tested, forgotten, and annihilated, but incorrect versions have always been around (at least since 1946 when binary search first appeared, according to Robert L. Kruse's book "Data Structures and Program Design in C++", page 280, as cited on the binary search wiki page). I searched several different web sites for "binary search", and almost all of them were copy-and-pastes of what's on wikipedia, and yet they weren't tested. Let's not take everything for face value. It seems we are drowning in information but are somewhat thirsty when it comes to actual facts. I believe in proof and testing! --75.165.249.107 10:38, 15 July 2007 (UTC)

Unfortunately, "proving" the correctness of a method's implementation merely shows a direct connection between the assumptions and the result. Less obvious are the implicit assumptions. I fully agree with the need for tested code, but there seems to be nothing that prevents people from making all manner of fiddles that turn out to fail. Much of the time, I suspect that the algorithm is in error, but I've lost patience with inspecting the latest change. Any corrections are soon enough uncorrected. The tests you mention above are good (it is always good to try the null cases, here zero elements to search, or one), but the trial fails to check for the problem first raised by MuthuKutty, namely the common use of (Left + Right)/2, which in 32-bit arithmetic works fine for all anticipated array sizes, until that is the size increases to 2,000,000,000 elements and integer overflow is possible, though not certain: 64-bit arithmetic might be in use in the computation even though 32-bit variables are involved. Just as 32-bit integer arithmetic can be done even though 16-bit variables are in use. In other words, it is implicitly assumed that overflow doesn't occur. Alas, exhaustive (brute force) testing is rarely possible. NickyMcLean 21:18, 15 July 2007 (UTC)

NickyMcLean, you are right, but I have only been concerned about the correctness of the actual algorithm and not the computer architecture or platform; I think they are kind of out of the scope of the binary search algorithm itself, though it is important to check. Other threads might mess it up, or overflow may occur, but before that is worried about, one should verify that the basic algorithm is correct. By the way, my testing code purposely started with an array of length=1 instead of length=0; I thought the "new" operator didn't work for 0-length arrays, but it looks like it does. It looks like you would have to throw exceptions or make sure your index didn't access a 0-length array.

I added the "Correctness and Testing" section in the hopes that people would carefully consider and test the page's code and their own code before making rash decisions and wrongfully changing the page's code. It is possible that some code may be absolutely correct under certain conditions but incorrect under other conditions (e.g. the addition of threads, or overflow), and some people might change what is actually good code because it doesn't work with their conditions; if that happens, I recommend that the codes be verified given certain assumptions/conditions, then perhaps adding a separate set of code for those certain conditions. Perhaps a new section should be added for threads, overflows, etc., and directions on how to test binary search with those conditions, but that is not my expertise. Someone else do it, please.

I also wish that twiddlers would test, but alas they don't, in part because the supplied code is not exactly in their favoured language, and, even if it were, all that is supplied is a code fragment with no example testing code such as you advocate so preparing a test is inconvenient. In the Fortran world, a long-established practice is to present a subroutine that does something (such as compute cubic splines) along with a commented-out main programme that invokes it with known data and tests for the correct results. Simply by removing the comment markers, a test programme was ready to be compiled and run. This was especially relevant when to use a published prog. it would have to be typed in and idiot mistypes would be overlooked. I well recall entering a fast fourier transform with factorisation (thus not just N = power of two) and in the proofreading with a colleague I deliberately misread some statements and he continued to say "yes", "yes". When presenting pseudocode for the odd wiki article I'm moved to contribute to, I have always taken an actual working programme, and then reshaped it lightly into pseudocode to hide the language-specific drivel and bring out the algorithm. This results in non-C code that others immediately revise into C, and while they're at it, errors are often introduced as variant forms are mixed together.

With regard to the binary chop, the case N = 0 should indeed work, to save special initialisation code when accumulating a list. Thus,

if NotFound then AddNewEntry;

rather than

if n <= 0 then PlaceFirstEntry          
 else if NotFound then AddNewEntry.

NickyMcLean 21:16, 16 July 2007 (UTC)

In fact, all the wiki programming algorithm pages probably should have "Correctness and Testing" sections for the basic conditions and for the more complex conditions. --75.165.249.107 22:47, 15 July 2007 (UTC)

This is part of why programs on my LiteratePrograms wiki usually can be downloaded and compiled and usually do include testing, because of this sort of problem. Dcoetzee 21:45, 16 July 2007 (UTC)

[edit] Code example

Apologies for the previous change. I had misread the loop condition as low<=high, which does require mid-1. (This is how I usually code this algorithm.) With the condition as low<high the code will terminate because mid is always less than (low+high)/2 unless low and high are the same. Sorry about this!

ICrann15 11:35, 16 July 2007 (UTC)

Ah well, I also have made just such a mistake. NickyMcLean 21:18, 16 July 2007 (UTC)

[edit] The "equal elements" section is horribly wrong!

The "equal elements" section states that the binary search will stop immediately when it finds an item equal to the one being searched, and that if there is more than one such item in the list, it's basically random which one of them is found, and that if for example the first one of them must be returned, a linear search must be performed for the equal elements.

This is horribly wrong. Sure, you can do it like that, but it's perfectly possible and quite trivial to implement the binary search in such way that it will not stop immediately when it finds an element equal to the one being searched, but continues until it has narrowed the search to exactly one element. This one element could be, for example, the first one of the (possibly) equal elements in the list (the other possibility is that it returns the last equal element in the list). No linear search of any kind is required. The search will still be purely O(log n), but it will always find the first element if more than one element equal to the searched one exists in the list. For example the std::lower_bound() function in the C++ library works like this.

Suggesting a linear search is a horrendously bad idea. The worst-case scenario is that the algorithm will perform n/2 comparisons, making it linear time, completely destroying the speed benefit of binary search. It's perfectly possible to do the O(log n) search even if there's a very large amount of equal elements.

85.194.211.152 13:40, 31 July 2007 (UTC)

[edit] Performance note

For in-memory searching, if the interval to be searched is small, a linear search may have superior performance simply because it exhibits better locality of reference.

Is this true? If the array is small, then it fits in one or two cache lines, and once cached, you are not going to miss it again regardless of what order you probe the elements in. I suspect the real reason that the linear search can run faster for a very short array has to do with branch prediction on pipelined archictures: if you sequentially equality-test each element, the equality test will usually be false, and the branch predictor will be wrong only once, when the item is found. But for a properly-implemented binary search, you search the left half or the right half of the remaining elements with equal probability, the branch predictor is of no help, and half the branches invalidate the pipeline. —Preceding unsigned comment added by 71.150.253.46 (talk) 12:43, 8 November 2007 (UTC)

on paper , or when looking up a telephone book i think id do a random search —Preceding unsigned comment added by 220.226.37.155 (talk) 06:09, 6 January 2008 (UTC)

== General problem with suggested primary algorithms, both recursive and iterative solutions

The description of the binary search algorithm is correct, but the shown examples are broken. The problem arises from the index range of the examples [0 to N-1] for a N value sorted Array to search and the calculated next indices.

As can be tested by the given sorting test program (which has to be modified a little if standard C is used instead of C++ <sort routine also needes the number of elements>), the start Index for an N element array search using the described algorithm has to be 1 and not 0, also the associated end Index has to be N and not N-1.

This is a VERY UNOBVIOUS but ESSENTIAL difference!

Reason of this problem:

 The described and commonly used (and so far correct) algorithm is described in D.E. Knuth
 Part 3, 6.2.1 [Algorithm B(Binary search)] with all element indices based to 1.
 The elements of the array to be searched are numbered from A[1]..A[N].

 It is obviously tempting to 'shift' down the index range from 1..N to 0..N-1 to ease
 access to the normally zero based search array, so to access it as A[0]..A[N-1].

 But this leads to incorrect results/malfunction due to the fact that the array
 midpoint estimation index function is defined as:
   NextIndex = floor((TopIndex + BottomIndex)/2).

 The floor function introduces a kind of 'nonlinear' behaviour. The 'correction' of this
 estimation function is done by the result of the element comparison:
   - if current element's value is the searched one, then found element and terminate.
   - if current element's value is below searched value then let TopIndex = NextIndex - 1
   - else let BottomIndex = NextIndex + 1
   if BottomIndex > Topindex, no match has been found.

 When 'shifting down' the index values range from 1..N to 0..N-1, the Midpoint estimation
 index value also shifts down, but only by an amount of 1/2. This leads to the fact, that
 there will be sometimes an access to an array element at A[-1] (expressed in indices 0..N-1)
 that may/will not exist (see discussion at D.E. Knuth).

 One possible modified solution to keep the algorithm working as desired AND to access the array elements
 starting at index 0 is:

   Initialize:
     - Set BottomIndex = 1
     - Set TopIndex    = N

   SearchLoop:
     - if (BottomIndex > TopIndex) : all elements tested, no match found, terminate (state: not found)    
     - Calculate/Compare:
        NextIndex = floor[(TopIndex + BottomIndex) / 2] - 1     // now Access index is rel. 0
        compare A[NextIndex] to given Value:
        . A[NextIndex] == Value: found match, terminate (state: found)
        . A[NextIndex]  < Value: let TopIndex    = NextIndex   // Note: -1 already done at step ahead
        . A[NextIndex]  > Value: let BottomIndex = NextIndex + 2 // Note: +1 -(-1) from step ahead

Gerhard Oed, HOB —Preceding unsigned comment added by 89.51.228.110 (talk) 22:36, 15 April 2008 (UTC)

[edit] Relevance of new additions

I'm discussing here the newly-added additions.

In short, I'm not sure the majority of this material really improves the article, for the following reasons:

The formalised notation and the "That it works" section are somewhat heavyweight, and IMO, somewhat unnecessary as a proof for what is really quite a trivial (and intuitive) mechanism.
The "That is fast" section is redundant, as the algorithmic complexity is already covered in the original material.
The "Extensions" are trivial extensions, and therefore (IMO), not really worth mentioning.
The "Computer usage" section is completely non-specific; issues of return values, number-range limits, and floating-point special cases apply to almost any algorithm that must be run on a computer. The issues of the three-way comparison and overflow of (high+low)/2 are already covered in the original material.

Incidentally, there was material already present that I'm not sure is really suitable for an encyclopaedia article, such as the "Testing" section. IMO, some serious editing needs to be done.

I welcome comments. Oli Filth^(talk) 21:56, 3 June 2008 (UTC)

Well, the method is simple, and its working is indeed intuitively clear for anyone searching a dictionary, under/overshooting a parameter in some process (e.g. an artillery shell overshoots/undershoots the target, change the elevation and charge appropriately) because the human intellect is active. But if you look at the stuff on the computer implementation you'll see that getting the details correct for a computer version is mistake prone and many people (including me, alas) have made mistakes, and the various pseudocode/C code implementations offered in the article are frequently wrong, corrected, miss-corrected and re-corrected so that at all times, a given version is in doubt. (Same with Quicksort, for that matter) and here the petty details of indexing arrays are frequently confused. Thus, I was thinking of developing the formal proof by noting that test programmes to assure the correct functioning of a binary search routine are in fact unreliable, unless they test all possible parameters, and this is of course impractical in time. Thus the need for emphasis on the difference between arithmetic (as used in the proof) and computer arithmetic (as used in a programme) and the resulting attention to the introduced problems. For my part, I'd never have imagined that a "NaN" value was such that x = x would return false, and would never have imagined a need to prepare for such considerations in a test programme. The issue of formal proofs of correctness of actual computer programmes is a large subject. I've spent the last two days in bed with a sniffle, and realised that I'd omitted mention of a detail from the proof. How now to know that a proof is correct?

That it is fast is not redundant in principle, but I ran out of time and patience as it was. The actual behaviour of the method is quite interesting. I agree that the article sprawls, and a regrouping would help. The extensions are trivial to those already familiar with computer programming, but an encyclopaedic article should have encyclopaedic coverage. I'm not too happy with the use of A(i).Key rather than just A(i) as is usual in the toy examples, but wanted to have the code's extension to real usage more easily seen, though in discussions not yet developed far.NickyMcLean (talk) 21:17, 5 June 2008 (UTC)

I think, in general, formal symbolic proofs of correctness are too low-level for inclusion in Wikipedia articles (except where we are explicitly demonstrating such a formal proof). A high-level proof of correctness here is succinct and perfectly rigorous by the standards of mathematicians. Spurious corrections by people with minimal understanding of the algorithm are an eternal issue that will not be discouraged by such additions (these people typically would dismiss such a proof). Dcoetzee 22:20, 5 June 2008 (UTC)