大家好,我正在尝试读取 workersinfo.txt 并将其存储到二维字符数组中。该文件大约有 4,000,000 行,每行大约 100 个字符。我想将每个文件行存储在数组中。不幸的是,我得到退出代码 139(内存不足)。我知道我必须使用 malloc()
和 free()
但我已经尝试了一些东西,但我无法让它们工作。最终我必须按 ID 号对数组进行排序,但我坚持声明数组。
该文件看起来像这样:
First Name, Last Name,Age, ID
Carlos,Lopez,,10568
Brad, Patterson,,20586
Zack, Morris,42,05689
到目前为止,这是我的代码:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
FILE *ptr_file;
char workers[4000000][1000];
ptr_file =fopen("workersinfo.txt","r");
if (!ptr_file)
perror("Error");
int i = 0;
while (fgets(workers[i],1000, ptr_file)!=NULL){
i++;
}
int n;
for(n = 0; n < 4000000; n++)
{
printf("%s", workers[n]);
}
fclose(ptr_file);
return 0;
}
最佳答案
堆栈内存是有限的。正如您在问题中指出的那样,您必须使用 malloc 来分配如此大的(需要我说巨大的)内存块,因为堆栈不能包含它。
您可以使用 ulimit
查看系统的限制(通常包括堆栈大小限制)。
在我的 Mac 上,限制是 8Mb。运行 ulimit -a
后,我得到:
...
stack size (kbytes, -s) 8192
...
或者,使用以下方法测试限制:
struct rlimit slim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur // the stack limit
我真心建议您分别处理每个数据库条目。
如评论中所述,在大多数实现中,将内存分配为静态内存会绕过堆栈。
不过,恕我直言,分配 400MB 内存(或 4GB,具体取决于我查看的问题的哪一部分)是错误的形式,除非完全需要 - 特别是对于单个函数。
后续Q1:如何分别处理每个DB entry
我希望我没有在做你的家庭作业或其他任何事情......但我怀疑你的家庭作业会包括将 400Mb 数据加载到计算机内存的作业......所以......回答你评论中的问题:
下面的单项处理草图并不完美 - 每个条目的数据限制为 1Kb(我认为对于这种简单的数据来说已经足够了)。
此外,我不允许使用 UTF-8 编码或类似的编码(我遵循将使用英语的假设)。
从代码中可以看出,我们分别读取每一行并执行错误检查以检查数据是否有效。
要按 ID 对文件进行排序,您可以考虑一次运行两行(这将是一种缓慢的排序)并对它们进行排序,或者使用 ID 数据创建一个排序的 node
树和该行在文件中的位置(在读取该行之前获取位置)。一旦你对二叉树进行了排序,你就可以对数据进行排序......
...二叉树可能会变得有点大。你查了吗sorting algorithms ?
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 1 on sucesss or 0 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return 0;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
return 0;
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 1;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
return 0;
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE* ptr_file;
ptr_file = fopen("workersinfo.txt", "r");
if (!ptr_file)
perror("File Error");
struct DBEntry line;
while (read_db_line(ptr_file, &line)) {
// do what you want with the data... print it?
printf(
"First name:\t%s\n"
"Last name:\t%s\n"
"Age:\t\t%s\n"
"ID:\t\t%s\n"
"--------\n",
line.first_name, line.last_name, line.age, line.id);
}
// close file
fclose(ptr_file);
return 0;
}
跟进 Q2:对 400MB-4GB 数据的数组进行排序
恕我直言,400MB 已经涉及到与大数据相关的问题。例如,就性能而言,在您的数据库上实现冒泡排序应该是令人痛苦的(除非它是单次任务,在这种情况下性能可能无关紧要)。
创建一个 DBEntry
对象数组最终会让您占用比实际数据更大的内存空间。
这不是是对大数据进行排序的最佳方式。
正确的方法取决于您的排序算法。 Wikipedia has a decent primer on sorting algorythms .
由于我们要处理大量数据,因此需要考虑以下几点:
划分工作是有意义的,因此不同的线程/进程对数据的不同部分进行排序。
我们需要尽量减少硬盘驱动器的 IO(因为这会显着降低排序速度并阻止在同一台机器/磁盘上进行并行处理)。
一种可能的做法是创建一个堆进行堆排序,但只存储一个优先级值并存储文件中的原始位置。
另一种选择可能是采用分而治之的算法,例如 quicksort ,同样,仅对计算出的排序值和条目在原始文件中的位置进行排序。
无论哪种方式,编写像样的排序方法都将是一项复杂的任务,可能涉及线程、 fork 、临时文件或其他技术。
这是一个简化的演示代码......它远未优化,但它演示了二叉排序树的想法,它保存排序值和数据在文件中的位置。
请注意,使用此代码会相对较慢(虽然不是那么慢)并且会占用大量内存...
另一方面,每个条目需要大约 24 个字节。对于 400 万个条目,它是 96MB,比 400Mb 好一些,绝对比 4GB 好。
#include <stdlib.h>
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// this might be a sorting node for a sorted bin-tree:
struct SortNode {
struct SortNode* next; // a pointer to the next node
fpos_t position; // the DB entry's position in the file
long value; // The computed sorting value
}* top_sorting_node = NULL;
// this function will free all the memory used by the global Sorting tree
void clear_sort_heap(void) {
struct SortNode* node;
// as long as there is a first node...
while ((node = top_sorting_node)) {
// step forward.
top_sorting_node = top_sorting_node->next;
// free the original first node's memory
free(node);
}
}
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 0 on sucesss or 1 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return -1;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
clear_sort_heap();
exit(2);
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 0;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
clear_sort_heap();
exit(1);
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// read and sort a single line from the database.
// return 0 if there was no data to sort. return 1 if data was read and sorted.
int sort_line(FILE* fp) {
// allocate the memory for the node - use calloc for zero-out data
struct SortNode* node = calloc(sizeof(*node), 1);
// store the position on file
fgetpos(fp, &node->position);
// use a stack allocated DBEntry for processing
struct DBEntry line;
// check that the read succeeded (read_db_line will return -1 on error)
if (read_db_line(fp, &line)) {
// free the node's memory
free(node);
// return no data (0)
return 0;
}
// compute sorting value - I'll assume all IDs are numbers up to long size.
sscanf(line.id, "%ld", &node->value);
// heap sort?
// This is a questionable sort algorythm... or a questionable implementation.
// Also, I'll be using pointers to pointers, so it might be a headache to read
// (it's a headache to write, too...) ;-)
struct SortNode** tmp = &top_sorting_node;
// move up the list until we encounter something we're smaller then us,
// OR untill the list is finished.
while (*tmp && (*tmp)->value <= node->value)
tmp = &((*tmp)->next);
// update the node's `next` value.
node->next = *tmp;
// inject the new node into the tree at the position we found
*tmp = node;
// return 1 (data was read and sorted)
return 1;
}
// writes the next line in the sorting
int write_line(FILE* to, FILE* from) {
struct SortNode* node = top_sorting_node;
if (!node) // are we done? top_sorting_node == NULL ?
return 0; // return 0 - no data to write
// step top_sorting_node forward
top_sorting_node = top_sorting_node->next;
// read data from one file to the other
fsetpos(from, &node->position);
char* buffer = NULL;
ssize_t length;
size_t buff_size = 0;
length = getline(&buffer, &buff_size, from);
if (length <= 0) {
perror("Line Copy Error - Couldn't read data");
return 0;
}
fwrite(buffer, 1, length, to);
free(buffer); // getline allocates memory that we're incharge of freeing.
return 1;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE *fp_read, *fp_write;
fp_read = fopen("workersinfo.txt", "r");
fp_write = fopen("sorted_workersinfo.txt", "w+");
if (!fp_read) {
perror("File Error");
goto cleanup;
}
if (!fp_write) {
perror("File Error");
goto cleanup;
}
printf("\nSorting");
while (sort_line(fp_read))
printf(".");
// write all sorted data to a new file
printf("\n\nWriting sorted data");
while (write_line(fp_write, fp_read))
printf(".");
// clean up - close files and make sure the sorting tree is cleared
cleanup:
printf("\n");
fclose(fp_read);
fclose(fp_write);
clear_sort_heap();
return 0;
}
关于c - 二维字符数组太大退出代码 139,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35930885/