The data of a million 32-bit integers would take up 4 megabytes, assuming binary encoding. So to sort them with minimal memory usage, the sorting algorithm most of people will consider is the MergeSort algorithm. The basic idea is to create a temporary file that is stored in the hard disk. Then read the first 0.5 million integers in the RAM, sort them, and then save them into a temporary file. Afterwards, read the remaining 0.5 million integers in the RAM, sort them, and then save them into another temporary file. Finally, we merge these two temporary files together.
However, there is a better solution without using temporary files (this is proposed by others. see http://news.ycombinator.com/item?id=341037).
Select the 500.000 smallest values, sort them using an in place sorting algorithm and spit them out, then select the 500.000 biggest values, sort them and spit them out.
The problem then boils down to select the n/2 smallest values of the list. To do this we store the n/2 first values in the 2MB buffer. We then organize them into a heap with the biggest at the top. I then process the remaining n/2 values. When one of these values is bigger than the heap top value, it is skipped. If it is smaller, the heap biggest value is replaced with the smaller value and the heap is restored to get the biggest value at the top. When all the n/2 values have been processed, the 2MB buffer contains the n/2 smallest values of the million integer list.
Since it is already organized as a heap one can use heap sort to sort it. Pick the biggest value at the top and swap it with the last value of the list and restore heap to get the next biggest value at the top. Swap it with the last unsorted value and proceed until the heap length is reduced to one element.
Apply the same algorithm with the n/2 biggest value but using a heap with the smallest value at the top.
This algorithm requires two passes over the million integer, and uses nothing more than the 2MB buffer. The complexity is O(nlog).
This algorithm requires two passes over the million integer, and uses nothing more than the 2MB buffer. The complexity is O(nlog).