what is External Sort? Explain with Example
External sort is a sorting technique used to sort large data sets that cannot fit in the main memory of a computer. It is used when the data is too large to fit in the memory and has to be stored on disk. External sort works by reading small chunks of the data at a time, sorting them, and then writing them back to disk. This process is repeated until all of the data is sorted.
For example, let's say we have a large data set of 1 terabyte that needs to be sorted. Instead of loading all of the data into memory at once, we can break it down into smaller chunks of, for example, 100 megabytes each. We can then read each chunk, sort it, and write it back to disk. Once all of the chunks are sorted, we can then merge them back together to form the final sorted data set.
In terms of code, external sort can be implemented using a combination of a sorting algorithm (such as quicksort or merge sort) and file input/output operations. Here is an example of how external sort can be implemented in Python:
#include
#include
#include
// Function to sort a chunk of data
void sort_chunk(int* chunk, int chunk_size) {
qsort(chunk, chunk_size, sizeof(int), cmpfunc);
}
// Function to merge sorted chunks of data
void merge_chunks(char* chunks[], int chunk_count, char* final_file) {
FILE* final = fopen(final_file, "w");
for (int i = 0; i < chunk_count; i++) {
FILE* current = fopen(chunks[i], "r");
int x;
while (fscanf(current, "%d", &x) != EOF) {
fprintf(final, "%d\n", x);
}
fclose(current);
}
fclose(final);
}
// External sort function
void external_sort(char* data_file, int chunk_size) {
// Open data file
FILE* data = fopen(data_file, "r");
// Create temporary chunks
char chunks[10][30];
int chunk_count = 0;
int* chunk = (int*)malloc(chunk_size * sizeof(int));
int chunk_index = 0;
int x;
while (fscanf(data, "%d", &x) != EOF) {
chunk[chunk_index++] = x;
if (chunk_index == chunk_size) {
sort_chunk(chunk, chunk_size);
sprintf(chunks[chunk_count], "chunk%d.txt", chunk_count);
FILE* current = fopen(chunks[chunk_count], "w");
for (int i = 0; i < chunk_size; i++) {
fprintf(current, "%d\n", chunk[i]);
}
fclose(current);
chunk_count++;
chunk_index = 0;
}
}
// Sort and write last chunk
sort_chunk(chunk, chunk_index);
sprintf(chunks[chunk_count], "chunk%d.txt", chunk_count);
FILE* current = fopen(chunks[chunk_count], "w");
for (int i = 0; i < chunk_index; i++) {
fprintf(current, "%d\n", chunk[i]);
}
fclose(current);
chunk_count++;
// Merge chunks
merge_chunks(chunks, chunk_count, "final_sorted.txt");
// Clean up temporary files
for (int i = 0; i < chunk_count; i++) {
remove(chunks[i]);
}
free(chunk);
}
In this example, the external_sort function takes in data and an optional chunk_size parameter. It then breaks the data down into chunks, sorts them, writes them to disk, and then merges them back together to form the final sorted data set. The function also uses a temporary directory to store the sorted chunks and cleans it up after the sorting is done.
It's important to note that this is a simplified example and the real-world scenarios can be more complex, you may need to handle the case when the data doesn't fit in memory, and the way to access the chunks of data on disk can vary depending on the language, framework and the system you are using.