Data Structures (Part I): Linear and Binary Searches
Lately, I have been thinking on how to understand data structures and their need on a more abstract and deeper level. I have come to the following observations.
Finding the solution of a problem has two facets: i) how is the initial data given (sorted, unsorted, multidimensional arrays, its size, rules of manipulation, etc.) and ii) what is the most efficient data structure to store and explore it (arrays, sets, hash maps, stacks, linked lists, dictionaries, trees, tries, heap, queues, graphs, etc.).
How do the different properties of the first facet affect the search? Which of the properties affect the most? And what gets majorly affected?
We start by answering the last question. The time and memory complexity of an algorithm are the two main features that can get affected by our choices. We will come back to this point later.
In general, the computer capacity is what first of all sets the limits of what one is able to do. By limits we should understand the size of the data which can be safely handled without getting into troubles. Imagine we had hardware capable of infinite calculations per seconds. We would not even notice the difference between an inefficient and efficient algorithm if their temporal difference was not meaningful.
As we do have finite computer capabilities, the field of data structures and efficient algorithms is relevant to our present needs. Even more if we are required to actually deal with big data sets.
One dimensional data
Let us consider the most simple scenario: a one dimensional array of elements. In general, the objective can be described by two actions: locating an element and doing an action with it (removing it, replacing it, inserting another element next to it, returning it, etc.).
This array of elements can have different characteristics:
- Data type (integer, float, strings, bits, ...).
- Sorted or unsorted (whenever this applies to the data type).
- With extra inputs that must be identified and eliminated before actually exploring the elements of interest.
- Elements with certain multiplicity (e.g. duplicated elements).
- Its size (from hundreds to billions or trillions of elements).
Locating an element
If the elements are unsorted. The worst case scenario would always require visiting all the array elements, O(N) where N is the length of the array.
On the other hand, for a sorted array, things are different. The slowest scenario is now given by the way we search. A linear search would still require visiting N elements.
def linear_search(target: int, nums: list()) -> bool:
for n in nums:
if n == target:
return True
return False
Other searches can reduce the amount of visited elements. One of such searches is Binary Search. It works by repeatedly dividing the search interval in half, and asking in which side could the target value be, until it is found or the search interval is empty. Binary search is a very efficient algorithm, with a time complexity of O(log2 N) where N is the size of the array. Let us compare it with a linear search. For a one billion (10**9) elements array, a binary search would only take around 27 steps!!! While a linear search would have taken 10**9!!! An extraordinary difference between the two.
def binary_search(target: int, nums: list()) -> bool:
l,r = 0, len(nums)-1
while l <= r:
mid = (r+l)//2
if nums[mid] == target:
return True
elif nums[mid] < target:
l = mid + 1
else:
r = mid - 1
return False
The above code exemplifies a binary search. We would have to modify it for large arrays to a difference of the two pointers:
mid = (r-l)//2 + (r-l)%2
With an additional increase or decrease depending on which half one should move in. An easy way to understand why the time complexity is O(log2 N) is to realize it takes log2 N operations to bring N, by repeatedly dividing it by two, to an O(1) number.