In order to study the performances of these two selection algorithms, I generated different size of arrays and compared the running speed of them. Small size of arrays were run for several times so that the result can be more accurate.

Performances of Two Selection Algorithm

Introduction

In order to study the performances of these two selection algorithms, I generated different size of arrays and compared the running speed of them. Small size of arrays were run for several times so that the result can be more accurate.

Input: array 𝐴 with 𝑛 distinct numbers and a number 𝑖

​  “Distinct” for simplicity

Output: 𝑖-th smallest element in the array

​  Assume index starts from 0

Example: \(𝐴 = (6,3,5,4,2),𝑖 = 2\) ​  Should return 4

Comparison of algorithms

There is no limitation of runtime for all algorithms, because all these two algorithms have similar time complexity \(O(n)\). Then I used DataGrapg to plot two graphs, one of small test cases, and another of all cases.

Loop several times

With #define LOOP_TIME 20 we can run it for 20 times so that the result can be more accurate. (Please checkperformance.cpp`)

long time_all = 0;
for (int lo = 0; lo < LOOP_TIME; lo++)
{
    int arr_copy[lines];
    //use deep copy to make arr_copy evry turn
    memset(arr_copy,0, lines*sizeof(int));
    memcpy(arr_copy,arr, lines*sizeof(int));
    start = clock();
    fn[i](arr_copy, lines);
    end = clock();
    time_all += (end - start);
}
cout<<"Sort algorithm is ["<<sortName[i]<<"],";
double time_run =  (double)time_all / CLOCKS_PER_SEC / LOOP_TIME;
cout << "Running time: " <<time_run<< endl;

test

You can run make test0 and make test1 to test the program.

  1. With // #define SELECTION_DEBUG
bogon:answer bingcheng$ make test0
g++ -std=c++11 -O3 -Wall -o generate gen_rand.cpp
g++ -std=c++11 -O3 -g -Wall -c main.cpp
g++ -std=c++11 -O3 -g -Wall -c selection.cpp
g++ -std=c++11 -O3 -g -Wall -o main main.o selection.o
./generate 0 40000 2000 > input.data
./main < input.data
The order-2000 item is -1932064068
bogon:answer bingcheng$ make test1
g++ -std=c++11 -O3 -Wall -o generate gen_rand.cpp
g++ -std=c++11 -O3 -g -Wall -o main main.o selection.o
./generate 1 40000 2000 > input.data
./main < input.data
The order-2000 item is -1932064068
  1. With #define SELECTION_DEBUG
bogon:answer bingcheng$ make test0
g++ -std=c++11 -O3 -Wall -o generate gen_rand.cpp
g++ -std=c++11 -O3 -g -Wall -c main.cpp
g++ -std=c++11 -O3 -g -Wall -o main main.o selection.o
./generate 0 40000 2000 > input.data
./main < input.data
selection algorithm is [R selection],
RunTime:0.000509
#[2000]smallest:[-1932064068], real:[-1932064068]
The order-2000 item is -1932064068
bogon:answer bingcheng$ make test1
g++ -std=c++11 -O3 -Wall -o generate gen_rand.cpp
g++ -std=c++11 -O3 -g -Wall -o main main.o selection.o
./generate 1 40000 2000 > input.data
./main < input.data
selection algorithm is [D selection],
RunTime:0.000941
#[2000]smallest:[-1932064068], real:[-1932064068]
The order-2000 item is -1932064068

You can get the real ith smallest and the selection ith smallest along with the runtime.

Big data analysis

From Figure 1, all these line looks similar, so it's meaning less. So we make log at both x and y axis. As Figure 2 shows, we can find that both Random Selection and Deterministic Selection Algorithm have the runtime \(O(n)\), and they are parallel when the size of the numbers is larger than \(10^4\).

So What happend when the size of the numbers is smaller than \(10^3\)? it seems that these two lines are connected to each other. Let's see the Small data analysis.

[Figure 1] time vs size in large period without log

As Figure 2 shows below, Rselection is not stable because the line of it is not as straight as Dselection. Moreover, Rselection is faster than Dselection.

[Figure 2] time vs size in large period

Small data analysis

From Figure 4 we can see that when the data size is small, it's so hard to recognize the runtime because the datas are so unstable. There are so many data hopping all the time, though I used Loop several times in my code, it doesn't work because the performance of my computer is not stable.

[Figure 3] time vs size in small period without log

Because these two lines are so similar, which means that we can use any of them when the number of datas are small.

[Figure 4] time vs size in small period

check memory leakage

==6463== LEAK SUMMARY:
==6463==    definitely lost: 0 bytes in 0 blocks
==6463==    indirectly lost: 0 bytes in 0 blocks
==6463==      possibly lost: 0 bytes in 0 blocks
==6463==    still reachable: 72,704 bytes in 1 blocks
==6463==         suppressed: 0 bytes in 0 blocks
==6463== Reachable blocks (those to which a pointer was found) are not shown.

No memory leakage.

Conclution

According to the data analysis, Dselect is not as good as Rselect in practice, because it has worse constants and it's not in-place.

With worse constants it will cost more time and you can find that in Figure 1. It's not in-place so it will cost more time to acquire for datas, which is inefficiency.

So at most conditions, we should use Rselect in practice.

Appendix

The project files

selection.h

#ifndef SELECTION_H
#define SELECTION_H

int random_selection(int* arr, const int n, const int order);
// Randomized selection algorithm
// MODIFIES: *arr
// EFFECTS: select i-th smallest element in the array

int deterministric_selection(int* arr, const int n, const int order);
// Deterministic selection algorithm
// MODIFIES: *arr
// EFFECTS: select i-th smallest element in the array

#endif

selection.cpp

#include <iostream>
#include <cstdlib>
#include <cassert>
#include <ctime>
#include "selection.h"

using namespace std;

static void int_append(int *arrA, const int *arrB, const int s){
  // MODIFIES: *arrA
    // EFFECTS: append first "s" int in arrB tp the biginning if arrA.
  assert(s >= 0);
  if(s == 0) return;
  for (int i = 0; i < s; ++i)
  {
    arrA[i] = arrB[i];
  }
}

static int random_pivot(int* arr, const int n){
// Choose pivot p from arr uniformly at random;
// Partition arr using pivot p;
// Let j be the index of p, return j;
  const int size = n;
  int BL = 0, BR = size-1;
  int * B = new int[size];
  int * A = arr;
  srand((unsigned)time(NULL));
  const int pivotat = rand()%size;
  const int t = A[pivotat];
  for (int i = 0; i < size; ++i)
  {
    if(i == pivotat) continue;
    if(A[i] > t) B[BR--] = A[i];
    else B[BL++] = A[i];
  }
  assert(BL == BR);
  B[BL] = t;
  int_append(A, B, size);
  delete[] B;
  return BL;
}

static void insertion_sort(int *arr, const int n){
// MODIFIES: *arr
// EFFECTS: sort integers arr[] in ascending order with insertion_sort.
  for (int i = 1; i < n; ++i)
  {
    int t = arr[i];
    int j = i;
    while (j >= 1)
        {
            if (arr[j - 1] > t)
            {
                arr[j] = arr[j - 1];
                j--;
            }
            else break;
        }
    arr[j] = t;
  }
}

static int Deterministic_pivot_helper(int* arr, int n){

  if(n == 1) return arr[0]; 
  int full_bucket = n/5;
  int arr_medians_size = full_bucket+(n%5+4)/5;
  int* arr_medians = new int [arr_medians_size];
  int incomplete_bucket = arr_medians_size - full_bucket;
  for (int i = 0; i < full_bucket; ++i)
  {
    int* arr_break_5 = arr + i*5;
    insertion_sort(arr_break_5, 5);
    arr_medians[i] = arr_break_5[2];
  }
  if (incomplete_bucket != 0)
  {
    int incomplete_bucket_size = n%5;
    int* arr_break_5 = arr + full_bucket*5;
    insertion_sort(arr_break_5, incomplete_bucket_size);
    arr_medians[full_bucket] = arr_break_5[incomplete_bucket_size/2];
  }
  int pivot = Deterministic_pivot_helper(arr_medians, arr_medians_size);
  return pivot;
}

static int partition_array(int *arr, const int n, const int pivot){
  // MODIFIES: *arr
  // EFFECTS: choose a pivotat then Move pivot to its correct place in the array.
  const int size = n;
  int BL = 0, BR = size-1;
  int * B = new int[size];
  int * A = arr;
  const int t = pivot;
  for (int i = 0; i < size; ++i)
  {
    // cerr<<arr[i]<<", ";
    if(A[i] == t) continue;
    if(A[i] > t) B[BR--] = A[i];
    else B[BL++] = A[i];
  }
  // cerr<<endl;
  // cerr<<"pivot = "<<pivot<<", size = "<<size<<", BL = "<<BL<<", BR = "<<BR<<endl;
  assert(BL <= BR);
  
  for (int i = BL; i <= BR; ++i)
  {
    B[i] = t;
  }
  int_append(A, B, size);
  delete[] B;
  return BL;
}

static int Deterministic_pivot(int* arr, int n){
// Choose pivot p from arr uniformly at deterministic;
// Partition arr using pivot p;
// Let j be the index of p, return j;
  int pivot = Deterministic_pivot_helper(arr, n);
  int j = partition_array (arr, n, pivot);
  return j;
}

int random_selection(int* arr, const int n, const int order){
  if(n == 1) return arr[0];
  int j = random_pivot(arr, n);
  if(j == order) return arr[order];
  if(j > order) {
    int* arr_left = arr;
    int length = j;
    return random_selection(arr_left, length, order);
  }
  else{
    int* arr_right = arr + j + 1;
    int length = n - j - 1;
    return random_selection(arr_right, length, order-j-1);
  }
}

int deterministric_selection(int* arr, const int n, const int order){
  if(n == 1) return arr[0];
  int j = Deterministic_pivot(arr, n);
  if(j == order) return arr[order];
  if(j > order) {
    int* arr_left = arr;
    int length = j;
    return deterministric_selection(arr_left, length, order);
  }
  else{
    int* arr_right = arr + j + 1;
    int length = n - j - 1;
    return deterministric_selection(arr_right, length, order-j-1);
  }
}

main.cpp

#include <iostream>
#include <cstdlib>
#include <assert.h>
#include <ctime>
#include "selection.h"

#define SELECTION_WAY_SIZE 2
// #define SELECTION_DEBUG

using namespace std;

const string selectionName[] = {
  "R selection","D selection","ERROR_SELECTION_Name"
};

#ifdef SELECTION_DEBUG
void insertion_sort(int *arr, const int n){
  for (int i = 1; i < n; ++i)
  {
    int t = arr[i];
    int j = i;
    while (j >= 1)
        {
            if (arr[j - 1] > t)
            {
                arr[j] = arr[j - 1];
                j--;
            }
            else break;
        }
    arr[j] = t;
  }
}
#endif


int main(int argc, char *argv[]) {
  int (*const fn[SELECTION_WAY_SIZE])(int*, const int, const int) = {
        random_selection,
        deterministric_selection
    };
    int selection_algorithm;
  cin >> selection_algorithm;
  assert(selection_algorithm >= 0 && selection_algorithm < SELECTION_WAY_SIZE);
  #ifdef SELECTION_DEBUG
  cout<<"selection algorithm is ["<<selectionName[selection_algorithm]<<"],"<<endl;
  #endif

  int n;
  cin >> n;
  int order;
  cin >> order;
  int *arr = new int[n];
  for (int i = 0; i < n; ++i)
  {
    cin>>arr[i];
  }
  int i_th_small;
  clock_t start, end;
  start = clock();
  i_th_small = fn[selection_algorithm](arr, n, order);
  end = clock();

  #ifdef SELECTION_DEBUG
  int arr_copy[n];
  //use deep copy to make arr_copy evry turn
  memset(arr_copy,0, n*sizeof(int));
  memcpy(arr_copy,arr, n*sizeof(int));
  insertion_sort(arr_copy, n);
  cout << "RunTime:" << (double)(end - start) / CLOCKS_PER_SEC << endl;
  cout << "#["<<order<<"]smallest:["<<i_th_small<<"], real:["<<arr_copy[order]<<"]"<<endl;
  #endif
  cout<<"The order-"<<order<<" item is "<<i_th_small<<endl;
  delete[] arr;
  return 0;
}

Makefile

all: main.o selection.o
  g++ -std=c++11 -O3 -g -Wall -o main main.o selection.o

auto: make_auto
  ./autogen

make_auto: auto_gen.o
  g++ -std=c++11 -O3 -g -Wall -o autogen auto_gen.o

auto_gen.o: auto_gen.cpp
  g++ -std=c++11 -O3 -g -Wall -c auto_gen.cpp

t: make_test
  ./test

make_test: simple_test.o selection.o
  g++ -std=c++11 -O3 -g -Wall -o test simple_test.o selection.o

simple_test.o: simple_test.cpp
  g++ -std=c++11 -O3 -g -Wall -c simple_test.cpp

test0: gen all
  ./generate 0 40000 2000 > input.data
  ./main < input.data

test1: gen all
  ./generate 1 40000 2000 > input.data
  ./main < input.data

gen:
  g++ -std=c++11 -O3 -Wall -o generate gen_rand.cpp

p: perf
  ./perform 

perf: performance.o selection.o
  g++ -std=c++11 -O3 -g -Wall -o perform performance.o selection.o

performance.o: performance.cpp
  g++ -std=c++11 -O3 -g -Wall -c performance.cpp

main.o: main.cpp
  g++ -std=c++11 -O3 -g -Wall -c main.cpp

selection.o: selection.cpp
  g++ -std=c++11 -O3 -g -Wall -c selection.cpp
  
v:
  valgrind --leak-check=full ./main < input.data

tar:
  tar czvf p2.tar main.cpp selection.cpp selection.h p2.pdf

clean:
  rm -f ./main *.o *.data test generate autogen perform

auto_gen.cpp

#include <iostream>
#include <stdlib.h>
#include <sstream>
#include <assert.h>
#include <fstream>
#include <math.h>
using namespace std;
int main(int argc, char *argv[]) {
  ofstream oFile;

  for (int i = 0; i < 100; ++i)
  {
    ostringstream path_stream;
    path_stream<<"./data_file/"<<i<<".data";
    oFile.open(path_stream.str());
    double k = 1.0 + 2.0/100*(i);
    int ek = pow(10, k);
    cerr<<ek<<", ";

    oFile<<0<<endl;
    oFile<<ek<<endl;
    oFile<<0<<endl;
    for (int i = 0; i < ek; ++i)
    {
      int k = mrand48();
      oFile << k <<endl;
    }
    oFile.close();
  }
}

Generated files look like this.

files

performance.cpp

This program will produce a CSV table as 3.1.7 shows.

#include <iostream>
#include <iomanip>
#include <sstream>
#include <cstdlib>
#include <assert.h>
#include <ctime>
#include <fstream>
#include "selection.h"

using namespace std;

#define SELECTION_WAY_SIZE 2
#define file_num 100
#define LOOP_TIME 40
// #define SORT_DEBUG

void debug_print(char TAG, string deb_string){
  if(TAG == 'v') cerr<< deb_string;
}

void   Delay(int   time){ 
  clock_t   now   =   clock(); 
  while(   clock()   -   now   <   time   ); 
} 

const string selectionName[] = {
  "R selection","D selection","ERROR_SELECTION_Name"
};

bool safe_open_file(ifstream& i_file, string file_name){
  ostringstream debug_stream;
  i_file.open(file_name.c_str());
  if (i_file.fail()) {
    cout<<"Error: Cannot open file "<< file_name<<"!"<<endl; 
    exit(0);
  }
  debug_stream<<"file opened success!"<<endl;
  debug_print('v', debug_stream.str());
  debug_stream.clear();
  return true;
}

static int int_size[] = {
  10, 10, 10, 11, 12, 12, 13, 13, 14, 15, 15, 16, 17, 18, 19, 19, 20, 21, 22, 23, 25, 26, 27, 28, 30, 31, 33, 34, 36, 38, 39, 41, 43, 45, 47, 50, 52, 54, 57, 60, 63, 66, 69, 72, 75, 79, 83, 87, 91, 95, 100, 104, 109, 114, 120, 125, 131, 138, 144, 151, 158, 165, 173, 181, 190, 199, 208, 218, 229, 239, 251, 263, 275, 288, 301, 316, 331, 346, 363, 380, 398, 416, 436, 457, 478, 501, 524, 549, 575, 602, 630, 660, 691, 724, 758, 794, 831, 870, 912, 954 };

bool jump_j[] = {false, false,false,false,false,false};

int main(int argc, char *argv[]) {
  int (*const fn[SELECTION_WAY_SIZE])(int*, const int, const int) = {
        random_selection,
        deterministric_selection
    };

  clock_t start, end;
  ofstream outFile;
  outFile.open("data.csv", ios::out);
  outFile << "size"<<","<<selectionName[0]<<","<<selectionName[1]<<","<<endl;
  for (int j = 0; j < file_num; ++j)
  {
    ifstream iFile;
    
    
    ostringstream path_stream;
    path_stream<<"./data_file/"<<j<<".data";
    safe_open_file(iFile, path_stream.str());
    int lines;
    iFile >> lines;
    iFile >> lines;
    int meanning_less;
    iFile >> meanning_less;

    
    int *arr = new int[lines];
    int baz;
    for (int i = 0; i < lines; ++i)
     {
     	iFile >> baz;
     	arr[i] = baz;
     } 
    outFile << int_size[j]<<",";
    for (int i = 0; i < SELECTION_WAY_SIZE; ++i)
    {
      if(jump_j[i] == true){
        outFile << ""<<",";
        cerr<<"jump "<<selectionName[i]<<" with "<<int_size[j]<<" size!"<<endl;
        continue;
      }
      if(int_size[j] < 1000){
        cerr<<"delay at "<<int_size[j]<<" size"<<endl;
       	Delay(1000);
      } else Delay(500);


      // int arr_copy[lines];
      // //use deep copy to make arr_copy evry turn
      // memset(arr_copy,0, lines*sizeof(int));
      // memcpy(arr_copy,arr, lines*sizeof(int));
      // start = clock();
      // fn[i](arr_copy, lines, 0);
      // end = clock();

      long time_all = 0;
            for (int lo = 0; lo < LOOP_TIME; lo++)
            {
                int arr_copy[lines];
                //use deep copy to make arr_copy evry turn
                memset(arr_copy,0, lines*sizeof(int));
                memcpy(arr_copy,arr, lines*sizeof(int));
                start = clock();
                fn[i](arr_copy, lines, 0);
                end = clock();
                time_all += (end - start);
            }


      cout<<"Sort algorithm is ["<<selectionName[i]<<"],";
      double time_run =  (double)time_all / CLOCKS_PER_SEC / LOOP_TIME;
      cout << "Running time: " <<time_run<< endl;
      if (time_run >= 24.0)
      {
        jump_j[i] = true;
      }
      outFile << time_run<<",";
    }
    outFile <<endl;


    iFile.close();
    delete[] arr;
  }
  outFile.close();
  return 0;
}

scv Table

You can get full CSV table at HERE.

This table was generated by MacBook Pro 2015, i5, 256G.

size R selection D selection
100 4.2e-06 8.4e-06
112 2.8e-06 6.4e-06
125 2.8e-06 1.66e-05
141 3.8e-06 6.6e-06
158 2.6e-06 6.6e-06
177 4.2e-06 6.6e-06
199 3.6e-06 8e-06
223 4e-06 8e-06
251 6e-06 9e-06
281 7e-06 1.38e-05
... ... ...
1778279 0.0146598 0.0363128
... ... ...
7943282 0.0283762 0.157991

DataGraph

I used DataGraph to generate images on MacBook Pro.

image-20181011173601662