python - 如何找到在 Graphlab SFrame 中保存时引发错误的特定行?

标签 python csv dataframe graphlab sframe

我有一个 SFrame,其外观与 sf.print_rows(10) 类似:

+--------------+---------------+-------+-------------------------------+
|   Dataset    |     Domain    | Score |             Sent1             |
+--------------+---------------+-------+-------------------------------+
| STS2012-gold | surprise.OnWN |  5.0  | render one language in ano... |
| STS2012-gold | surprise.OnWN |  3.25 | nations unified by shared ... |
| STS2012-gold | surprise.OnWN |  3.25 | convert into absorbable su... |
| STS2012-gold | surprise.OnWN |  4.0  | devote or adapt exclusivel... |
| STS2012-gold | surprise.OnWN |  3.25 | elevated wooden porch of a... |
| STS2012-gold | surprise.OnWN |  4.0  | either half of an archery bow |
| STS2012-gold | surprise.OnWN | 3.333 | a removable device that is... |
| STS2012-gold | surprise.OnWN |  4.75 |      restrict or confine      |
| STS2012-gold | surprise.OnWN |  0.5  |     orient, be positioned     |
| STS2012-gold | surprise.OnWN |  4.75 | Bring back to life, return... |
+--------------+---------------+-------+-------------------------------+
+-------------------------------+-------------------------------+
|             Sent2             |        Sent1_tokenized        |
+-------------------------------+-------------------------------+
| restate (words) from one l... | [render, one, language, in... |
| a group of nations having ... | [nations, unified, by, sha... |
| soften or disintegrate by ... | [convert, into, absorbable... |
| devote oneself to a specia... | [devote, or, adapt, exclus... |
| a porch that resembles the... | [elevated, wooden, porch, ... |
| either of the two halves o... | [either, half, of, an, arc... |
| a supplementary part or ac... | [a, removable, device, tha... |
| place limits on (extent or... |    [restrict, or, confine]    |
|          be opposite.         |   [orient,, be, positioned]   |
|  cause to become alive again. | [Bring, back, to, life,, r... |
+-------------------------------+-------------------------------+
+-------------------------------+-----------+-----------+----------------------+
|        Sent2_tokenized        | Sent1_len | Sent2_len | NGRAM-cosChar2ngrams |
+-------------------------------+-----------+-----------+----------------------+
| [restate, (words), from, o... |     6     |     8     |      0.82090085      |
| [a, group, of, nations, ha... |     8     |     7     |      0.53250804      |
| [soften, or, disintegrate,... |     11    |     11    |      0.43274232      |
| [devote, oneself, to, a, s... |     10    |     8     |      0.47759567      |
| [a, porch, that, resembles... |     6     |     9     |      0.38885689      |
| [either, of, the, two, hal... |     6     |     12    |      0.55555556      |
| [a, supplementary, part, o... |     10    |     5     |      0.44963552      |
| [place, limits, on, (exten... |     3     |     6     |      0.27124449      |
|        [be, opposite.]        |     3     |     2     |      0.43528575      |
| [cause, to, become, alive,... |     8     |     5     |      0.37047929      |
+-------------------------------+-----------+-----------+----------------------+
+----------------------+----------------------+----------------------+
| NGRAM-cosChar3ngrams | NGRAM-cosChar4ngrams | NGRAM-cosChar5ngrams |
+----------------------+----------------------+----------------------+
|      0.74964917      |      0.71490469      |      0.67925959      |
|      0.36701702      |      0.28941438      |      0.23635427      |
|      0.25899951      |      0.21053227      |      0.17058877      |
|      0.26248718      |      0.20518234      |      0.14285714      |
|      0.17107978      |      0.12049505      |      0.09320546      |
|      0.40754381      |      0.24715577      |      0.11547005      |
|      0.21997067      |      0.17554945      |      0.15450786      |
|      0.13284223      |      0.09284767      |       0.048795       |
|      0.31426968      |      0.17149859      |      0.09449112      |
|      0.0632772       |      0.03402069      |         0.0          |
+----------------------+----------------------+----------------------+
+---------------------+---------------------+---------------------+---------------------+

[19097 rows x 134 columns]

但是当我尝试使用 sf.save('trainers.csv', format='csv') 将其保存到 csv 中时,它会抛出错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-f82bcb3fa197> in <module>()
----> 1 sts.save('trainers.csv', format='csv')

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in save(self, filename, format)
   2924                 self.export_json(url)
   2925             else:
-> 2926                 raise ValueError("Unsupported format: {}".format(format))
   2927 
   2928     def export_csv(self, filename, delimiter=',', line_terminator='\n',

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

我打印了n号。一次一行,例如sf.print_rows(10)sf.print_rows(100)sf.print_rows(129) 处抛出错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-24-13550768dbcd> in <module>()
----> 1 sts.print_rows(129)

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in print_rows(self, num_rows, num_columns, max_column_width, max_row_width, output_file)
   2226         max_row_width = max(max_row_width, max_column_width + 1)
   2227 
-> 2228         printed_sf = self._imagecols_to_stringcols(num_rows)
   2229         row_of_tables = printed_sf.__get_pretty_tables__(wrap_text=False,
   2230                                                          max_rows_to_display=num_rows,

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in _imagecols_to_stringcols(self, num_rows)
   2250                 if t in image_column_names:
   2251                     printed_sf[t] = self[t].astype(str)
-> 2252         return printed_sf.head(num_rows)
   2253 
   2254     def __str_impl__(self, num_rows=10, footer=True):

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in head(self, n)
   2454         tail, print_rows
   2455         """
-> 2456         return SFrame(_proxy=self.__proxy__.head(n))
   2457 
   2458     def to_dataframe(self):

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.head()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.head()

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

所以我做了一个sf.fillna(c, 0):

for c in sts.column_names():
    sts = sts.fillna(c, 0)

它抛出另一个错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-26-e63cf73308dd> in <module>()
      1 for c in sts.column_names():
----> 2     sts = sts.fillna(c, 0)

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in fillna(self, column, value)
   5652             raise TypeError("Must give column name as a str")
   5653         ret = self[self.column_names()]
-> 5654         ret[column] = ret[column].fillna(value)
   5655         return ret
   5656 

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in fillna(self, value)
   2439 
   2440         with cython_context():
-> 2441             return SArray(_proxy = self.__proxy__.fill_missing_values(value))
   2442 
   2443     def topk_index(self, topk=10, reverse=False):

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Default value must be convertible to column type

如何查找在 Graphlab SFrame 中保存时引发错误的特定行?

如何修复这一行?我可以用 fillna() 替换行中有问题的列吗?我无法真正使用 dropna() 丢弃这些行,因为我需要跟踪有问题的行。

但即使使用 dropna(),我最终得到的是:

sf.dropna()
sf.save('trainers.csv', format='csv')

如何找到这些给我错误或 ZeroDivisionErrors 的行?以及如何纠正它们或用零填充这些列?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-28-f82bcb3fa197> in <module>()
----> 1 sts.save('trainers.csv', format='csv')

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in save(self, filename, format)
   2924                 self.export_json(url)
   2925             else:
-> 2926                 raise ValueError("Unsupported format: {}".format(format))
   2927 
   2928     def export_csv(self, filename, delimiter=',', line_terminator='\n',

/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

奇怪的是,当我尝试使用以下命令迭代 SFrame 时,我无法迭代 SFrame:

for i in sf:
    print i

它抛出此错误:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-29-d2d0035d7bbe> in <module>()
----> 1 for i in sts:
      2     print i

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in generator()
   3712         def generator():
   3713             elems_at_a_time = 262144
-> 3714             self.__proxy__.begin_iterator()
   3715             ret = self.__proxy__.iterator_get_next(elems_at_a_time)
   3716             column_names = self.column_names()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.begin_iterator()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.begin_iterator()

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 10, in <lambda>
TypeError: 'NoneType' object is not iterable

事情变得更奇怪了,我无法使用 sf[num] 检索特定行,但我可以执行子 SFrame,然后检索特定的 num 行。所以这个:

print sf[25]

中断和抛出:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-62-6bc8898704c0> in <module>()
----> 1 print sts[25]

/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sframe.pyc in __getitem__(self, key)
   3595             ub = min(sf_len, lb + block_size)
   3596 
-> 3597             val_list = list(SFrame(_proxy = self.__proxy__.copy_range(lb, 1, ub)))
   3598             self._cache["getitem_cache"] = (lb, ub, val_list)
   3599             return val_list[key - lb]

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.copy_range()

graphlab/cython/cy_sframe.pyx in graphlab.cython.cy_sframe.UnitySFrameProxy.copy_range()

RuntimeError: Runtime Exception. Traceback (most recent call last):
  File "<ipython-input-5-e29b4d4eba06>", line 10, in <lambda>
TypeError: 'NoneType' object is not iterable

但是当我尝试提取子集然后打印时,它起作用了。下面的代码检索之前使用上面的代码抛出错误的第 25 个元素:

x =  sf[:30]
print x[25]

前面带有 sf[25] 的代码抛出 NoneType 是否有原因? sf[0]sf[24] 有效,但任何高于 25 的值都无效。

显然,以这种方式迭代 SFrame 并将其转储为 str sorta 有效:

fout = open('superbad.txt', 'w')
sflen = len(sf)
i = 0
while i < sflen:
    m = i+100 if i+100 < sflen else sflen
    x = sf[i:m]
    for j in x:
        fout.write(str(j) +'\n\n')

这很奇怪。 为什么分块迭代并转储到字符串有效?

最佳答案

问题是运行应用时出现除零错误(在保存上方的某个位置)

RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-5-e29b4d4eba06>", line 20, in <lambda>
ZeroDivisionError: division by zero

发生这种情况是因为惰性求值 ( https://en.wikipedia.org/wiki/Lazy_evaluation )。作为示例,假设我从具有单列的 SFrame 开始

sf = gl.SFrame({'x': range(10000, -1, -1)})
sf['x'].apply(lambda x: 1.0/x)

此时,SFrame 的最后一行包含 1.0/0 值,这是一个错误,但尚未对其进行评估。 save 方法会触发具体化,即数据中所有行的实际计算,然后导致错误发生。您可以通过调用 __materialize__

来触发此过程
sf.__materialize__()

这会导致发生以下错误。

RuntimeError: Runtime Exception. Traceback (most recent call last):
File "<ipython-input-55-5af90e232e2d>", line 1, in <lambda>
ZeroDivisionError: float division by zero

惰性评估和查询规划作为性能优化非常重要,也是 SFrame 快速且可扩展的原因之一。不幸的是,跟踪错误是它的烦恼之一,但是一旦您了解它的工作原理,您就会习惯它。

head() 函数不会触发完整的具体化,因此您可以在任意多的行上执行它,直到发现错误为止。

关于python - 如何找到在 Graphlab SFrame 中保存时引发错误的特定行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34654901/

相关文章:

python - GCP应用引擎中出现"Connection in use"错误

python - App Engine Standard 上的 Postgres - 插入时出错

Android:共享 CSV 文件

python - 用 Pandas : Is there an equivalent to dplyr's select(. ..,一切())重新排列列?

python - celery 工厂功能与进口 celery

python - Numpy 将标量转换为数组

python - 将 numpy 数组保存到 csv 会产生 TypeError Mismatch

Python无法运行程序

python - 从 pandas 数据帧中分离并创建字典

python - 同时填充 Pandas 数据框中相关列中的缺失值