arrays - 用于共享数组子范围的智能垃圾收集器？

在 this popular question about why substring takes O(n) in C# ，提供的主要答案之一认为，如果分配了一个大数组并且通过让新字符串仅引用数组的一小部分来计算子字符串，那么垃圾收集器将无法回收包含更大字符串的字符数组如果不再引用原始字符串。

这似乎是一个完全有效的答案，但在理论上似乎可以为数组构建一个垃圾收集器，允许对数组的大部分进行垃圾收集，同时留下一些仍在使用的小子数组。换句话说，如果有一个 50,000 元素的数组，其中只有一个小的 100 元素切片仍在使用，垃圾收集器可以将数组分成三部分 - 100 元素切片之前的元素，100 元素切片之前的元素切片本身，以及 100 元素切片之后的元素 - 然后垃圾收集这些片段中的第一个和最后一个。

我的问题是任何语言实现是否真的使用这种垃圾收集器，或者它是否仅存在于理论上。有谁知道有这样一个垃圾收集器的语言实现的例子？

最佳答案

理论上，是的……这是可能的。但是GC有一个问题:
收集垃圾，它需要知道数据的布局
存储在内存中，并且它还必须存储数据以指示是否
内存块是否在使用中...但布局信息是共享的
使用运行时，因为运行时需要知道对象类型
(即内存布局)以进行类型转换。

GC 是如何工作的？

GC 开始读取它知道的根对象。它获取所有引用
并将它们标记为正在使用。对于这些引用的对象中的每一个，它得到
布局并从论文中阅读更多引用资料，并标记它们
在使用中......并且这个过程继续，直到没有更多的引用。

注:我使用了类型信息和布局信息，含义相同。

例子:

Imagine we have some object layouts:
====================================
A: { int, object, double, object }
B: { object, object }
C: { int }


Memory Data:
============
Now we need to describe what is in memory.
The following list is composed of memory positions,
and the data contained in each position.
There are 2 kinds of notation I will use:
  - Address: ROOT[ list-of-root-references ]
  - Address: type-identifier { object-data }
     Note that each object can span multiple memory
     positions from the first one.
     e.g. 90: B { 0, 0 } -- this reads as
       "stored at address 90: type-identifier of B followed by raw data { 0, 0 }"
  - A reference is represented by a number,
    that point to the memory address of the pointed object.

 1: ROOT[ 90, 20, 30 ]
20: A { 1236, 30, 0.5, 60 }
30: B { 60, 70 }
40: C { 1237 }
50: A { 1238, 20, 0.8, 50 }    There is a self-reference here!
60: C { 1234 }
70: A { 1234, 0, 0.7, 80 }
80: C { 1235 }
90: B { 0, 0 }                 Note that 0 is a null reference!

   The GC need to know the layout of each object.
   Otherwise it wouldn't be abled to tell
   what knid of information is stored in each memory position.

Running the GC:
===============
Garbage collecting steps, to clean the memory described above:
Step 1: Get ROOT references: 2, 3, 9 and mark as 'in-use'
Step 2: Get references from 2, 3 and 9: 3, 6, 7. Note that 0 is a null reference!
Step 3: Get references from 3, 6 and 7: 6, 7, 8, and mark them.
Step 4: Get references from 6, 7, 8, and mark them: only 8!
Step 5: Get references from 8... it has none! We are finished with marking objects.
Step 6: Delete unmarked objects.


      This shows what happened in each step with each object stored in the memory.

Step ->  1  2  3  4  5
Memory                    
20       x           
30       x  x        
40                   DELETE
50                   DELETE
60          x  x     
70          x  x     
80             x  x  
90       x

我描述的是一个非常基本的 GC 算法。

看看三色标记……真是太棒了!
这就是真正的现代 GC 的制作方式。

Garbage collection (computer science) - 描述了一些现代 GC 方法。

但是......关于布局的信息存储在哪里？

这个问题很重要，因为它会影响 GC 和运行时。
要进行快速类型转换，类型信息必须放在引用附近，
或在分配的内存附近。我们可以考虑存储类型信息
在分配的内存块的目录中，但是...类型转换需要
访问目录，就像 new 运算符和 GC 需要时一样
删除对象。

如果我们将布局信息存储在引用附近，那么每个引用
同一个对象将重复相同的信息，以及指针
本身。

例子:

To represent the memory data I will introduce the following notation:
 - Address: { object-data } -- this time object type is not at the begining!
 - A reference is represented by a type-identifier and an address-number,
   that point to the memory address of the pointed object,
   in the following format: type/number...
   e.g. A/20 -- this reads as: "at address 20, there is an object of type A"
Note that: 0/0 is a null reference,
      but it still requires space to store the type.

The memory would look like this:
 1: ROOT[ B/90, A/20, B/30 ]
20: { 1236, B/30, 0.5, C/60 }
30: { C/60, A/70 }
40: { 1237 }
50: { 1238, A/20, 0.8, A/50 }
60: { 1234 }
70: { 1234, 0/0, 0.7, C/80 }
80: { 1235 }
90: { 0/0, 0/0 }

如果我们将布局信息存储在分配的内存块附近，那就太好了!
它速度快，并且避免了重复的布局信息。

例子:

The memory looks like the first sample:
 *This is the same notation used at the begining of this answer.
 1: ROOT[ 90, 20, 30 ]
20: A { 1236, 30, 0.5, 60 }
30: B { 60, 70 }
40: C { 1237 }
50: A { 1238, 20, 0.8, 50 }
60: C { 1234 }
70: A { 1234, 0, 0.7, 80 }
80: C { 1235 }
90: B { 0, 0 }

到目前为止，一切都很好......但现在我想要共享内存。

我们注意到的第一件事是我们不能将布局信息存储在
分配内存了。

想象一个具有共享内存的数组:

例子:

 I'll introduce a new notation for arrays:
    type-identifier < array-length >

 1: ROOT[ 20, 31 ] -- pointer 31 is invalid, destination has no type-identifier.
20: INT<5>  -- info about layout of the next memory data (spans by 10 memory units)
30:  2
31:  3 -- should have type-identifier, because someone
32:  5              is pointing here!! Invalid memory!!
33:  7
34: 11

我们仍然可以尝试将布局信息放在指针旁边，
反而。现在可以使用共享内存数组:

例子:

 1: ROOT[ INT<5>/30, INT<2>/31 ]
30:  2
31:  3
32:  5
33:  7
34: 11

请记住，这种方法使我们在各处重复布局信息......
但这里的重点是使用更少的内存不是吗？？？但是要共享内存，我们需要
更多内存来存储布局数据/指针。我们还没有 donut 。 =(

只有一个希望:让我们降低运行时特性!

这是我的答案 - 我认为这是可能的 =>

使用内存分配目录来存储类型信息怎么样？

这可以做到，但是动态转换会受到影响，GC 也会受到影响。
记得我说过 GC 需要访问内存目录，只是为了删除对象......
好吧，现在每次找到引用时都需要访问目录，
不仅仅是删除。我的天啊!!我们即将用这个来杀死 GC 性能，以及运行时性能。我觉得成本太高了!

<= 这是我的答案

但是......如果运行时不支持动态转换？如果编译器知道
编译时关于类型的一切......然后GC甚至不存在......它需要
关于类型的信息，因为这些信息告诉它什么是布局
该类型使用的内存。

看不到简单、智能的解决方案。

也许我完全错了。但我想不出比现在更好的方法了。
现代 GC 比这更复杂......我在这里只介绍了基础知识，
我认为，现代 GC 正在以其他方式进行优化，即其他更可靠的方式。

其他引用:

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)

http://www.memorymanagement.org/glossary/t.html

http://www.cs.purdue.edu/homes/sbarakat/cs456/gc-2.pdf

Tri-Color Incremental Updating GC: Does it need to scan each stack twice?

关于arrays - 用于共享数组子范围的智能垃圾收集器？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6855808/

arrays - 用于共享数组子范围的智能垃圾收集器？

上一篇：从保存的检查点恢复训练模型时，Tensorflow 批量损失峰值？

下一篇：ruby-on-rails-3 - Mysql2 错误 MySQL 服务器已消失