关于pandas.DataFrame.copy()的小坑

最近发现了一个关于pandas.DataFrame.copy()的小坑,特此小记;

  • 我使用的pandas的版本:
import pandas as pd
pd._version.get_versions()

"""
{'dirty': False,
 'error': None,
 'full-revisionid': 'a00154dcfe5057cb3fd86653172e74b6893e337d',
 'version': '0.22.0'}
"""
  • 常规情况:
a = pd.DataFrame([[1]])
a.loc[0, 0] # 1

b = a.copy() #.copy(deep=True) as default
b.loc[0,0] = 2
b.loc[0, 0] # 2
a.loc[0, 0] # 1
  • 进一步:
a = pd.DataFrame([[[1]]])
a.loc[0, 0] # [1]

b = a.copy() #.copy(deep=True) as default
b.loc[0,0] = [2]
b.loc[0, 0] # [2]
a.loc[0, 0] # [1]
  • 进而:
a = pd.DataFrame([[[1]]])
a.loc[0, 0] # [1]

b = a.copy() #.copy(deep=True) as default
b.loc[0,0][0] = 2
b.loc[0, 0] # [2]
a.loc[0, 0] # [2]

嗯,上面最后一行我没有打错,就是[2]。

想了想出现这种情况的原因,应该是因为即使指定了deep=True,但在复制的时候并对其中的list进行deep copy;

  • 其他的发现:
a = pd.DataFrame([[1]])
print(a.loc[0][0])

b = a.copy()
b.loc[0,0] = 2
print(b.loc[0, 0])  # 2
print(a.loc[0, 0])  # 1

print(a.loc[0,0] is b.loc[0,0]) # False
print(id(a.loc[0,0]) == id(b.loc[0,0])) #True

# ------

a = pd.DataFrame([[[1]]])
print(a.loc[0][0])

b = a.copy()
b.loc[0,0][0] = 2
print(b.loc[0, 0]) # [2]
print(a.loc[0, 0]) # [2]

print(a.loc[0,0] is b.loc[0,0])  # True
print(id(a.loc[0,0]) == id(b.loc[0,0])) # True

神奇…


18.8.10:

​ 之前发现了问题之后去GitHub上提了Issues,前几天收到了一位pandas的contributor的回复:


18.8.13:

​ 又得到了另一位开发者的回复,他指出首先.copy(deep=True)在官方文档上已经写到了这一点:

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

他还指出:

id(a.loc[0,0]) == id(b.loc[0,0])

这个语句中,

the Python interpreter could perform the following steps:

  1. Evaluate a.loc[0, 0]; then
  2. Get the id of the temporary object created in step 1; then
  3. Evaluate b.loc[0, 0]; then
  4. Get the id of the temporary object created in step 3.

If the temporary object created in step 1 is GC’ed in between, the temporary object created in step 3 may be created at the same address. (In CPython, the id function returns the memory address of an object, although this is considered a CPython implementation detail.)

One case see examples of this just using plain old Python objects:

In [13]: id(object()), id(object())
Out[13]: (4763425312, 4763425312)

In [19]: print(object() is object())
False

In [20]: print(id(object()) == id(object()))
True

后来,我又去尝试了一下copy.deepcopy(),发现即使是这个方法,依然不能达到我想要的效果。再次Google之后得到的答案是,只能乖乖地把index和value分别做deepcopy,然后再构造一个新的DataFrame。

Contents


本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可。

知识共享许可协议