快速排序的基本思想,【外部排序】快排思想完成外部排序

 2023-12-25 阅读 46 评论 0

摘要:問題描述 ? 利用快排思想,實現外排。 算法思想 將磁盤劃分為四個區域:middle, small, large, input當input為空時,從磁盤中讀取數據將游標存放在middle中,middle采用雙端堆進行存儲,可以以log?(n)\log(n)log(n)的效率獲取最大最小值。 在本

問題描述

? 利用快排思想,實現外排。

算法思想

在這里插入圖片描述

  1. 將磁盤劃分為四個區域:middle, small, large, input
  2. 當input為空時,從磁盤中讀取數據
  3. 將游標存放在middle中,middle采用雙端堆進行存儲,可以以log?(n)\log(n)log(n)的效率獲取最大最小值。 在本實驗中,因為數據可能重復的緣故,所以使用了stl中自帶的優先隊列進行實現,而非使用更高效的平衡二叉樹實現
  4. 先從磁盤中讀取數據,填充Middle
  5. 不斷從input中讀取數據,與游標進行比較,較大值放入Large,較小值放入Small;當輸出緩存滿時寫入磁盤兩側
  6. 遞歸磁盤較大組合較小組

算法實現

緩存類

struct Cache
{int *cache;int capacity;// the start position of the cache in disk// when the cache is empty, startPosition = endPosition = 0int diskStartPosition;int diskEndPosition;// when Cache is not filled, endPosition - startPosition != capacityint startPosition;int endPosition;int size;
};
  1. 聲明緩存類管理緩存,同時保存部分相關數據;但因為數組位置信息保存在數組類當中,所以不提供讀寫操作

Middle Cache的實現

雙端堆的實現

template <class T>
class DoubleEndPriorityQueue
{friend class DoubleEndPriorityQueueCache;//升序隊列priority_queue<int, vector<int>, greater<int>> *acc;//降序隊列priority_queue<int, vector<int>, less<int>> *desc;public:int size(){return acc->size();}bool isEmpty(){return desc->empty();}void insert(int x){acc->push(x);desc->push(x);}int getMin(){return acc->top();}int getMax(){return desc->top();}void deleteMin(){int min = acc->top();bool flag = false;acc->pop();priority_queue<int, vector<int>, less<int>> *temp = new priority_queue<int, vector<int>, less<int>>();while (!desc->empty()){if (desc->top() != min){temp->push(desc->top());desc->pop();}else{if (!flag){flag = true;desc->pop();}else{temp->push(desc->top());desc->pop();}}}delete desc;desc = temp;}void deleteMax(){int i = desc->top();bool flag = false;desc->pop();priority_queue<int, vector<int>, greater<int>> *temp = new priority_queue<int, vector<int>, greater<int>>();while (!acc->empty()){if (acc->top() != i){temp->push(acc->top());acc->pop();}else{if (!flag){flag = true;acc->pop();}else{temp->push(acc->top());acc->pop();}}}delete acc;acc = temp;}DoubleEndPriorityQueue(){acc = new priority_queue<int, vector<int>, greater<int>>();desc = new priority_queue<int, vector<int>, less<int>>();}DoubleEndPriorityQueue(const DoubleEndPriorityQueue<T> &right){acc = right.acc;desc = right.desc;}
};
  1. 為了節約時間,并沒有自己手動實現一個雙端堆
  2. 因為系統包括網上大多數紅黑樹及AVL樹的實現都不允許數據重復,所以實驗中使用了stl中的堆進行組合

數組類的實現

class ExternalArray
{friend class ExternalArrayTest;private:int length = 0;Cache inputCache;Cache smallCache;Cache largeCache;// the cache range is [startPosition, endPosition) and start from 0//  when smallCache is swapped into disk, the replaced data in disk is store in here//  and when tempCache is not dealt with, the input will input from here firstCache tempCache;DoubleEndPriorityQueueCache middleCache;fstream fio;string filePath;size_t diskIOCount = 0;
};
  1. 數組類包含對應的各個緩存
  2. 使用fstream進行文件操作
  3. 使用diskIOCount計算磁盤IO數量

算法思路

快速排序的基本思想?在這里插入圖片描述

  1. 將分段最左邊讀入游標緩存:將分段最左側讀入游標緩存,若分段小于游標緩存,則直接寫回。否則進行下一步。
  2. 創建游標(cur),遍歷分段:游標不斷右移,進行分片,直到游標到達end
  3. Small cache 寫滿:游標不斷右移,所以不會出現覆蓋的情況,若寫滿則直接寫回
  4. Large cache寫滿:
    1. 寫回
    2. 看情況保存被覆蓋的磁盤數據(當游標到寫回位置的距離小于large cache 的容量)移動end位置
  5. 完成遍歷后:
    1. 將temp cache 中的所有數據處理
    2. 將三個cache中的所有數據寫回
    3. 遞歸余下兩個分段

緩存操作函數

寫緩存:

    // write to diskvoid writeCache(Cache &cache, int diskStartPosition, int startPosition, int endPosition, bool sequencial = true){cache.diskStartPosition = diskStartPosition;fio.seekp(diskStartPosition * 13, ios::beg);for (int i = startPosition; i < cache.size; i++){fio.write(formateInt(cache[i]).append(" ").c_str(), 13);}cache.diskEndPosition = diskStartPosition + cache.size;fio.seekp(-1, ios::end);fio << NULL;fio.clear();if (!sequencial){cache.diskEndPosition = -1;cache.diskStartPosition = -1;}diskIOCount++;}void writeCache(DoubleEndPriorityQueueCache &cache, int diskStartPosition){cache.diskStartPosition = diskStartPosition;int count = 0;fio.seekp(diskStartPosition * 13, ios::beg);while (!cache.empty()){fio.write(formateInt(cache.popMin()).append(" ").c_str(), 13);count++;}cache.diskEndPosition = diskStartPosition + count;fio.seekp(-1, ios::end);fio << NULL;fio.clear();diskIOCount++;}

讀緩存:

    // basic read method// read from disk// if the disk is empty or shorter than cache, then fill with zero// the append parameter is used to determine whether the cache is appended to the old data// when read from disk, the status flag(those position) is automatically set// return count number of dataint readCache(Cache &cache, int start, int end, int diskStartPosition, bool append = false, bool sequencial = true){fio.seekg(diskStartPosition * 13, ios::beg);int i = start, count = 0;if (!append)cache.size = 0;for (; i < end && fio.peek() != EOF; i++, count++){fio >> cache[i];cache.size++;}if (append){// cache.cacheStartPosition is equal to old onecache.endPosition += count;// cache.diskStartPosition is equal to old onecache.diskEndPosition += count;}else{cache.startPosition = start;cache.endPosition = start + count;cache.diskStartPosition = diskStartPosition;cache.diskEndPosition = diskStartPosition + count;}for (; i < end; i++)cache[i] = 0;if (!sequencial){cache.diskEndPosition = -1;cache.diskStartPosition = -1;}fio.clear();diskIOCount++;return count;}void readCache(DoubleEndPriorityQueueCache &cache, int diskStartPosition, int length){cache.diskStartPosition = diskStartPosition;fio.seekg(diskStartPosition * 13, ios::beg);int t, i;for (i = 0; i < length && fio.peek() != EOF; i++){fio >> t;cache.insert(t);}cache.diskEndPosition = diskStartPosition + i;fio.clear();diskIOCount++;}

按完成率怎么排序?當tempCache非空時,優先讀取tempCache中的數據:

    // start from pos, and fill input cache//  is the tempCache is not empty(or valid), then the input will be read from tempCache firstvoid readInputCache(int pos, int length){if (length > inputCache.capacity){cout << "invalid parameter" << endl;exit(0);}if (tempCache.size == 0)readCache(inputCache, 0, length, pos, false);else{int i = 0;inputCache.size = 0;inputCache.startPosition = 0;inputCache.endPosition = 0;for (; i < length && tempCache.size > 0; i++, tempCache.size--){inputCache[i] = tempCache[i];tempCache.endPosition--;inputCache.size++;inputCache.endPosition++;}if (inputCache.size < length)readCache(inputCache, i, length, pos, true, false);inputCache.diskStartPosition = -1;inputCache.diskEndPosition = -1;tempCache.startPosition = -1;tempCache.endPosition = -1;}}void readTempCache(int pos, int length){readCache(tempCache, tempCache.endPosition, tempCache.endPosition + length, pos, true, false);}

封裝后的操作:

    // add a new element to the end of the output cache, if output cache then write to disk// if head == true, then write from head, else write from tail// in fact head == true means the element is added from the small cache// if head == false, then the element is added from the large cachebool add(Cache &cache, int x, bool head, int &diskStartPosition, int &currentDiskEndPosition){bool willMiss = false;if (cache.size < cache.capacity){cache[cache.endPosition] = x;cache.endPosition++;cache.size++;}else{if (head){writeCache(cache, diskStartPosition, 0, cache.capacity, false);diskStartPosition += cache.capacity;}else{if (diskStartPosition - cache.capacity > currentDiskEndPosition){readTempCache(diskStartPosition - cache.capacity, cache.capacity);writeCache(cache, diskStartPosition - cache.capacity, 0, cache.capacity, false);diskStartPosition -= cache.capacity;willMiss = true;}else{readTempCache(currentDiskEndPosition + 1, diskStartPosition - currentDiskEndPosition - 1);writeCache(cache, diskStartPosition - cache.capacity, 0, cache.capacity, false);diskStartPosition -= cache.capacity;}}cache.startPosition = 0;cache.endPosition = 1;cache[0] = x;cache.size = 1;}return willMiss;}

快排實現

  1. 在有了上述代理之后,便可以利用普通快排的思想進行排序
  2. 其中注意排序后要清空緩存
    // the array is start from start and end to end in diskvoid quickSort(int start, int end){// point to current position in diskint cur = start;// initializationreadCache(middleCache, start, min(middleCache.capacity, end - start));cur += middleCache.size();// point to the location in disk where the small cache should be writtenint small = start;// point to the location in disk where the large cache should be writtenint large = end;int willMiss = 0;//開始處理while (cur < large || willMiss > 0){if (willMiss > 0){cur -= min(inputCache.size, tempCache.size);willMiss--;}readInputCache(cur, min(large - cur, inputCache.capacity));for (int i = inputCache.startPosition; i < inputCache.endPosition; i++){int record = inputCache[i];if (record <= middleCache.getMin())add(smallCache, record, true, small, cur);else if (record >= middleCache.getMax()){if (add(largeCache, record, false, large, cur))willMiss++;}else{add(smallCache, middleCache.popMin(), true, small, cur);middleCache.insert(record);}cur++;if (cur >= large)break;}}for (int i = 0; tempCache.size > 0; i++, tempCache.size--){int record = tempCache[i];if (record <= middleCache.getMin())add(smallCache, record, true, small, cur);else if (record >= middleCache.getMax())add(largeCache, record, false, large, cur);else{add(smallCache, middleCache.popMin(), true, small, cur);middleCache.insert(record);}cur++;tempCache.endPosition--;}writeCache(smallCache, small, 0, smallCache.size, false);writeCache(middleCache, small + smallCache.size);writeCache(largeCache, large - largeCache.size, 0, largeCache.size, false);smallCache.size = 0;smallCache.startPosition = 0;smallCache.endPosition = 0;largeCache.size = 0;largeCache.startPosition = 0;largeCache.endPosition = 0;int a = middleCache.diskStartPosition;if (a - start > 1)quickSort(start, a);int b = middleCache.diskEndPosition;if (end - b > 1)quickSort(b, end);}

實驗結果與分析

測試類

class ExternalArrayTest
{
public:// test double end priority queuevoid test0(){srand(time(NULL));ExternalArray a(10, 10, 10, 2000);for (int i = 0; i < 2000; i++){a.middleCache.insert(rand() % 10000);}a.writeCache(a.middleCache, 0);cout << "isAccendent = " << a.isAccendent() << endl;}// test constructorvoid test1(){ExternalArray a(10);a.addRandomNumber(1000);}// test readCache()void test2(){ExternalArray a(10);a.addRandomNumber(1000);a.readInputCache(0, a.inputCache.capacity);for (int i = 0; i < 10; i++){cout << a.inputCache[i] << " ";}a.readInputCache(990, a.inputCache.capacity);cout << endl;for (int i = 0; i < 10; i++){cout << a.inputCache[i] << " ";}}void test3(){ExternalArray a(10, 10, 10, 20);a.addRandomNumber(5);a.readCache(a.middleCache, 0, 10);a.middleCache.print();a.readCache(a.inputCache, 0, 10, 0, false);a.inputCache.print();a.addRandomNumber(95);a.readCache(a.middleCache, 0, 10);a.middleCache.print();a.readCache(a.inputCache, 0, 10, 0, false);a.inputCache.print();}// test writeCache()void test4(){ExternalArray a(10, 10, 10, 20);a.addRandomNumber(100);a.middleCache.insert(0);a.middleCache.insert(1);a.middleCache.insert(2);a.writeCache(a.middleCache, 0);a.inputCache[0] = 9;a.inputCache[1] = 99;a.inputCache[2] = 999;a.writeCache(a.inputCache, 3, 0, 3);}void test5(){ExternalArray a(10, 10, 10, 20);a.addRandomNumber(100);a.readCache(a.middleCache, 0, 10);a.middleCache.print();a.readCache(a.inputCache, 0, 10, 0, false);a.inputCache.print();}//足夠大的內存int test6(){ExternalArray a(10, 10, 10, 10000);a.addRandomNumber(10000);a.quickSort();return a.isAccendent();}int test7(){ExternalArray a(10, 10, 10, 80);a.addRandomNumber(100);a.quickSort();return a.isAccendent();}//恰好夠大的內存int test8(){ExternalArray a(10, 10, 10, 80);a.addRandomNumber(100);a.quickSort();return a.isAccendent();}//不夠大的內存int test9(){ExternalArray a(100, 100, 100, 300);a.addRandomNumber(1000);a.quickSort();return a.isAccendent();}//大量數據int test10(){ExternalArray a(2500, 2500, 2500, 30000);a.addRandomNumber(100000);a.quickSort();return a.isAccendent();}//非常大量數據int test11(){ExternalArray a(2500, 2500, 2500, 30000);a.addRandomNumber(1000000);a.quickSort();return a.isAccendent();}void runTest(){if (test6() == -1)cout << "test6 pass" << endl;if (test7() == -1)cout << "test7 pass" << endl;if (test8() == -1)cout << "test8 pass" << endl;if (test9() == -1)cout << "test9 pass" << endl;if (test10() == -1)cout << "test10 pass" << endl;}void experience_diskIO(){for (int length = 1000; length < 3000; length += 500){for (int inputCacheSize = 100; inputCacheSize < 400; inputCacheSize += 100){for (int middleCacheSize = 300; middleCacheSize < 600; middleCacheSize += 100){for (int largeCacheSize = 500; largeCacheSize < 1000; largeCacheSize += 100){ExternalArray a(inputCacheSize, inputCacheSize, inputCacheSize, middleCacheSize);a.addRandomNumber(length);a.quickSort();if (a.isAccendent() == -1){cout << "length = " << length << " input and output cache size = " << inputCacheSize << "; middleCacheSize = " << middleCacheSize << " diskIOCount = " << a.diskIOCount << endl;}}}}}}void experience_time(){for (int length = 1000; length < 3000; length += 500){for (int inputCacheSize = 100; inputCacheSize < 400; inputCacheSize += 100){for (int middleCacheSize = 300; middleCacheSize < 600; middleCacheSize += 100){for (int largeCacheSize = 500; largeCacheSize < 1000; largeCacheSize += 100){ExternalArray a(inputCacheSize, inputCacheSize, inputCacheSize, middleCacheSize);a.addRandomNumber(length);clock_t start = clock();a.quickSort();clock_t end = clock();cout << "length = " << length << " input and output cache size = " << inputCacheSize << "; middleCacheSize = " << middleCacheSize << " time = " << (double)(end - start) / CLOCKS_PER_SEC << endl;}}}}}
};

實驗結果

正確性分析

快速排序在所有排序方法中速度最快,在這里插入圖片描述

實驗結果

在這里插入圖片描述

實驗分析

? 在本次實驗中,為了盡量地分析緩存大小對排序效率的影響,所有的緩存都可以進行獨立配置。其大小如下

  • inputCache: inputCacheCapacity
  • smallCache: smallCacheCapacity
  • largeCache: largeCacheCapacity
  • tempCache: smallCacheCapacity + largeCacheCapacity

? 以上緩存是直接使用的內存空間,同時還有遞歸帶來的棧空間消耗O(length)

? 所以總的內存消耗為:inputCacheCapacity+2(smallCacheCapacity+largeCacheCapacity) + O(length)

? 而運行時間瓶頸不再CPU上,而在磁盤io上,因此通過計算磁盤io的訪問次數來大致預測時間復雜度。磁盤io次數分析如下

  • 當middleCache可以容納所有的數據時:1
  • 當所有輸出cache和middleCache的總和能容納所有數據時:3
  • 當cache無法容納所有數據():

平均情況:共發生了log(n)次遞歸,每段遞歸段的大小為length/log(n)

對于長度為length的數組字段:在平均情況下,數組大小成均勻分布,寫入smallCache和largeCache的概率相等,而每一次遞歸固定讀取一次middleCache。

  • 此時若outputCache大小若大于inputCache,當outputCache未填滿時,inputCache從磁盤中讀取數據,次數為2×inputCacheCapacitysmallCacheCapacity+largeCacheCapacity\frac{2 \times inputCacheCapacity}{smallCacheCapacity+largeCacheCapacity}smallCacheCapacity+largeCacheCapacity2×inputCacheCapacity?,接下來,平均下來每次outputCache寫入時都會將文件臨時存到內存,此時無需inputCache從磁盤中讀取文件,訪問總次數為:2length?(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacitysmallCacheCapacity+largeCacheCapacity2\frac{length-(smallCacheCapacity+largeCacheCapacity)-middleCacheCapacity}{smallCacheCapacity+largeCacheCapacity}2smallCacheCapacity+largeCacheCapacitylength?(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacity?所以總次數為:

2×inputCacheCapacitysmallCacheCapacity+largeCacheCapacity+2length?(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacitysmallCacheCapacity+largeCacheCapacity\frac{2 \times inputCacheCapacity}{smallCacheCapacity+largeCacheCapacity}\\ +2\frac{length-(smallCacheCapacity+largeCacheCapacity)-middleCacheCapacity}{smallCacheCapacity+largeCacheCapacity} smallCacheCapacity+largeCacheCapacity2×inputCacheCapacity?+2smallCacheCapacity+largeCacheCapacitylength?(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacity?

  • 若outputCache小于inputCache,盡管outputCache在寫入的時候會將磁盤中同位置的文件讀到緩存,但是大小并不足以inputCache讀取,inputCache依然要讀取緩存,此時磁盤io數為:
    2length?middleCacheCapacitysmallCacheCapacity+largeCacheCapacity+length?middleCacheCapacityinputCacheCapacity2\frac{length-middleCacheCapacity}{smallCacheCapacity+largeCacheCapacity}\\ +\frac{length-middleCacheCapacity}{inputCacheCapacity} 2smallCacheCapacity+largeCacheCapacitylength?middleCacheCapacity?+inputCacheCapacitylength?middleCacheCapacity?
    此時發生發生的總磁盤iO數為
    (2×inputCacheCapacitysmallCacheCapacity+largeCacheCapacity+2lengthlog?(length)?(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacitysmallCacheCapacity+largeCacheCapacity)×log?(length)(\frac{2 \times inputCacheCapacity}{smallCacheCapacity+largeCacheCapacity}\\ +2\frac{\frac{length}{\log(length)}-(smallCacheCapacity+largeCacheCapacity)-middleCacheCapacity}{smallCacheCapacity+largeCacheCapacity})\\ \times \log(length) (smallCacheCapacity+largeCacheCapacity2×inputCacheCapacity?+2smallCacheCapacity+largeCacheCapacitylog(length)length??(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacity?)×log(length)

    (2lengthlog?(length)?middleCacheCapacitysmallCacheCapacity+largeCacheCapacity+lengthlog?(length)?middleCacheCapacityinputCacheCapacity)×log?(length)(2\frac{\frac{length}{\log(length)}-middleCacheCapacity}{smallCacheCapacity+largeCacheCapacity}\\ +\frac{\frac{length}{\log(length)}-middleCacheCapacity}{inputCacheCapacity})\\ \times \log(length) (2smallCacheCapacity+largeCacheCapacitylog(length)length??middleCacheCapacity?+inputCacheCapacitylog(length)length??middleCacheCapacity?)×log(length)

    而對于最壞情況 ,每一次遞歸都將所有的數據進行一次移動,此時發生lenght次遞歸,每次遞歸長度為length

    而磁盤io次數分析同上 (盡管是這樣分析,但實際情況在這種極端情況下,很有可能總是往一個緩存中寫入),公式如下:

(2×inputCacheCapacitysmallCacheCapacity+largeCacheCapacity+2length?(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacitysmallCacheCapacity+largeCacheCapacity)×length(\frac{2 \times inputCacheCapacity}{smallCacheCapacity+largeCacheCapacity}\\ +2\frac{length-(smallCacheCapacity+largeCacheCapacity)-middleCacheCapacity}{smallCacheCapacity+largeCacheCapacity})\\ \times length (smallCacheCapacity+largeCacheCapacity2×inputCacheCapacity?+2smallCacheCapacity+largeCacheCapacitylength?(smallCacheCapacity+largeCacheCapacity)?middleCacheCapacity?)×length


(2length?middleCacheCapacitysmallCacheCapacity+largeCacheCapacity+length?middleCacheCapacityinputCacheCapacity)×length(2\frac{length-middleCacheCapacity}{smallCacheCapacity+largeCacheCapacity}+\frac{length-middleCacheCapacity}{inputCacheCapacity})\times length (2smallCacheCapacity+largeCacheCapacitylength?middleCacheCapacity?+inputCacheCapacitylength?middleCacheCapacity?)×length

版权声明:本站所有资料均为网友推荐收集整理而来,仅供学习和研究交流使用。

原文链接:https://808629.com/196542.html

发表评论:

本站为非赢利网站,部分文章来源或改编自互联网及其他公众平台,主要目的在于分享信息,版权归原作者所有,内容仅供读者参考,如有侵权请联系我们删除!

Copyright © 2022 86后生记录生活 Inc. 保留所有权利。

底部版权信息