c - fgets 返回更少的字符

标签 c linux fgets

我正在编写一个汇编程序来练习。汇编程序使用 c 库函数。我特别关注 fgets() 函数。 fgets 手册页指出:

fgets()  reads in at most one less than size characters from stream and
stores them into the buffer pointed to by s.  Reading  stops  after  an
EOF  or a newline.  If a newline is read, it is stored into the buffer.
A terminating null byte ('\0') is stored after the  last  character  in
the buffer.

我已经声明了一个 1024 字节的缓冲区,并在 fgets 函数中使用它从文件中读取文本。但是程序返回了 1019 个字符。它似乎总是少返回 5 个字符,所以如果我使用 1029 的缓冲区,它确实会返回 1024 个字符。我想知道为什么 fgets 函数以这种方式工作或者是我的代码?我的程序如下:

#include <stdio.h>

int main(){

  FILE *fopen(), *fp, *fp2;
  char buff[1024];

  fp = fopen("test.txt", "r");

  fgets(buff, 1024, (FILE*)fp);

  fp2 = fopen("outputtest.txt", "w");
  //fprintf(fp2, "This is testing for fprintf...\n");
  fputs(buff, fp2);

  fclose(fp);
  fclose(fp2);

}

输入在 1020 位置不包含任何空字节或换行符,因此最多应返回 1023。以下是输入:

this a test file. The development of Linux is one of the most prominent examples of free and open-source software collaboration. The underlying source code may be used, modified and distributed—commercially or non-commercially—by anyone under the terms of its respective licenses, such as the GNU General Public License. Typically, Linux is packaged in a form known as a Linux distribution, for both desktop and server use. Some of the popular mainstream Linux distributions are Debian, Ubuntu, Linux Mint, Fedora, openSUSE, Arch Linux and Gentoo, together with commercial Red Hat Enterprise Linux and SUSE Linux Enterprise Server distributions. Linux distributions include the Linux kernel, supporting utilities and libraries, and usually a large amount of application software to fulfill the distribution's intended use. Distributions oriented toward desktop use typically include X11, a Wayland implementation or Mir as the windowing system, and an accompanying desktop environment such as GNOME or the KDE Software Compilation; some distributions may also include a less resource-intensive desktop such as LXDE or Xfce. Distributions intended to run on servers may omit all graphical environments from the standard install, and instead include other software to set up and operate a solution stack such as LAMP. Because Linux is freely redistributable, anyone may create a distribution for any intended use.

输出如下:

this a test file. The development of Linux is one of the most prominent examples of free and open-source software collaboration. The underlying source code may be used, modified and distributed—commercially or non-commercially—by anyone under the terms of its respective licenses, such as the GNU General Public License. Typically, Linux is packaged in a form known as a Linux distribution, for both desktop and server use. Some of the popular mainstream Linux distributions are Debian, Ubuntu, Linux Mint, Fedora, openSUSE, Arch Linux and Gentoo, together with commercial Red Hat Enterprise Linux and SUSE Linux Enterprise Server distributions. Linux distributions include the Linux kernel, supporting utilities and libraries, and usually a large amount of application software to fulfill the distribution's intended use. Distributions oriented toward desktop use typically include X11, a Wayland implementation or Mir as the windowing system, and an accompanying desktop environment such as GNOME or the KDE Software

上面以一个空格结束,组成了完整的 1019 个字符返回。我想知道是什么原因造成的。我的汇编程序可以工作,但当然读取的字符数量不正确。有人可以向我解释为什么会这样吗?

提前致谢。

最佳答案

将评论转化为答案。

在 Mac OS X 上运行,您的代码根据 ls -l 生成一个 1023 字节的输出文件.但正如您所见,我的输出文件在“KDE Software”(尾随空白)之后结束。您如何确定输出文件的大小?你对你的计数有多确定?问题是否出现在较短的缓冲区大小(比如 32 字节)时——也就是说,输出是否比您认为应该的短 5 个字节?

然后 rici正确noted :

It is surely relevant that the sample text includes two instances of U+2014 EM DASH (—), whose UTF-8 encoding is e2 80 94.

这是极有可能的——甚至可以确定。它解释了为什么 vim我使用 1024| 时似乎把光标放错了位置——它计算的是字符而不是字节——这让我很困惑。当我运行时:wc -m在 Mac 上,我得到 1019 个(多字节)字符,但仍然是 1023 个字节。

user1803784 observed :

I used atom.io text editor to get the count and the error start occurring at 256 bytes. I tried 128 bytes, 64 bytes, 32 bytes and the error does not occur it returns 127 bytes, 63 bytes, 31 bytes respectively (as the manual page stated "at most one less than size characters from stream").

由于第一个“—”破折号出现在偏移量 194 处,看来您的问题完全与“字节与字符”以及您使用的是 UTF-8 编码数据这一事实有关。被视为非零 (NUL) 字节的纯流,您最多可以将 1023 个字节读入 buff,这就是您的代码正在做的事情。但是,如果您计算字符数而不是字节数,则您有两个 3 字节字符(两个 em-dash 字符),这意味着您的字符数比字节数少 4。您刚刚了解到您的编辑器计算字符数;诸如 ls 之类的程序报告字节。一般来说,这两个数字是不同的。

我们还可以观察到引用的手册页所指的“字符”是 char -type 字符,又名“字节”(在大多数系统上——有些机器 char 不是 8 位字节)。混淆部分源于 C 标准。

ISO/IEC 9899:2011 §7.21.7.2 fgets函数 说:

The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.

添加斜体强调

相比之下, fgets() 的 POSIX 规范说fgets()字节为单位指定:

The fgets() function shall read bytes from stream into the array pointed to by s, until n-1 bytes are read, or a <newline> is read and transferred to s, or an end-of-file condition is encountered. The string is then terminated with a null byte.

添加斜体强调

页面注释为:

The functionality described on this reference page is aligned with the ISO C standard. Any conflict between the requirements described here and the ISO C standard is unintentional. This volume of POSIX.1-2008 defers to the ISO C standard.

这是引用 ISO/IEC 9899:1999,因为 POSIX.1-2008 在 C11 之前发布,但 C99 §7.19.7.2 中的措辞与 C11 中的相同。可以说,POSIX 措辞比 C 标准措辞更容易理解。然而,标准的定义部分说:

3.7
1 character
〈abstract〉 member of a set of elements used for the organization, control, or representation of data

3.7.1 1 character single-byte character
〈C〉 bit representation that fits in a byte

3.7.2
1 multibyte character sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment
2 NOTE The extended character set is a superset of the basic character set.

3.7.3
1 wide character value representable by an object of type wchar_t, capable of representing any character in the current locale

因此,在上下文中,“字符”是指大多数人认为的“字节”(注意——并非所有机器都有 CHAR_BIT == 8)。

关于c - fgets 返回更少的字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32663407/

相关文章:

python 解压缩——非常慢?

c - 引用 fgets,\0 如何合并到普通文本文件中

c - fgets 和 get 的区别

c - fgets malloc c 帮助

c - 用C写一个程序把字符串分成token并打印出来

c - 从文件扫描,最后一个字符串重复

c - C 中的指针和在链表中的使用

java - 在不破坏东西的情况下将项目添加到 Ant 的路径

linux - 将 linux 内核信号量初始化为负数是否合法?

c - 将返回对象指针的函数强制转换为返回空指针的函数是否合法?