python - 如何通过字符串将 pandas 数据框分割成 block ?

标签 python list python-3.x pandas

我有一个 pandas 数据框,它是通过附加一系列列表生成的,主要由具有分隔符(“'\n'”)的字符串组成,如下所示:

   content

0   American Regent/Luitpold (Reverified 10/26/2016)\nCompany Contact Information:\n800-645-1706\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\n2 mL single-dose vial, package of 10 (NDC 00517-2502-10) Available for NDC 00517-2502-10. Demand increase for the drug
1   Amphastar Pharmaceuticals, Inc./IMS (Reverified 08/18/2016)\nCompany Contact Information:\n800-423-4136\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\nCalcium Chloride Inj. USP, 10%, 10mL Luer-Jet Prefilled Syringe, (NDC 0548-3304-00), new (NDC 76329-3304-1) Product available Demand increase for the drug\nHospira, Inc. (Reverified 10/21/2016)
2   American Regent/Luitpold (Reverified 10/26/2016)\nCompany Contact Information:\n800-645-1706\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\n10%, 50 mL vial; Calcium (0.465 mEq/mL), Preservative Free (NDC 0517-3950-25) Unavailable for NDC 00517-3950-25. No product available for release. No plan to manufacture. American Regent is currently not releasing Calcium Gluconate 50 mL vial (NDC 00517-3950-25). Other\n10%, 100 mL vial; Calcium (0.465 mEq/mL), Preservative Free (NDC 0517-3900-25) Unavailable for NDC 00517-3900-25. American Regent is currently not releasing Calcium Gluconate 100 mL vial (NDC 0517-3900-25). Other\nFresenius Kabi USA, LLC (Revised 11/01/2016)
 .......
n   Apotex Corp. (Revised 05/16/2016)\nCompany Contact Information:\n800-706-5575\n\nPresentation\n1gm; (25 Vials) (NDC 60505-0749-5)\n1gm; (25 Vials)(NDC 60505-6093-5)\n10 gm; (10 Vials) (NDC 60505-0769-0)\n10 gm; (10 Vials) (NDC 60505-6094-0)\nNote:\nAvailable\nB. Braun Medical Inc. (Revised 05/16/2016)\n\n\nBaxter Healthcare (Revised 05/16/2016)\n\n\nFresenius Kabi USA, LLC (Revised 05/16/2016)\n\n\nHospira, Inc. (Revised 05/16/2016)\n\n\nSagent Pharmaceuticals (Revised 05/16/2016)\n\n\nSandoz (Revised 05/16/2016)\n\n\nWest-Ward Pharmaceuticals (Revised 05/16/2016)\n\n\nWG Critical Care (Revised 05/16/2016)
n-1 Apotex Corp. (Reverified 10/26/2016)\nCompany Contact Information:\n800-706-5575\n\nPresentation Availability and Estimated Shortage Duration Related Information Shortage Reason (per FDASIA)\nCefepime for Injection, USP 1 gm (10 Vials) (NDC 60505-6030-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for Injection, USP 2 gm (10 Vials)(NDC 60505-6031-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 1 gm (10 Vials) (NDC 60605-0834-04) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 2 gm (10 Vials) (NDC 60505-0681-4) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 1 gm (1 Vial) (NDC 60505-0834-00) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nCefepime for injection, USP 2 gm (10 Vials) (NDC 60505-0681-0) On backorder. Shortage duration is unknown. Requirements relating to complying with current good manufacturing practices (cGMP).\nB. Braun Medical Inc. (New 07/22/2015)\n\n\nBaxter Healthcare (Reverified 10/25/2016)\n\n\nFresenius Kabi USA, LLC (Revised 11/01/2016)\n\n\nHospira, Inc. (Reverified 10/21/2016)\n\n\nSagent Pharmaceuticals (Revised 08/29/2016)\n\n\nWG Critical Care (Revised 06/08/2016)

如何通过新行 \n 将数据帧的内容分隔在更多列中:

   col1              col2        col3        col4
0  Shire US Inc. (Reverified 07/01/2016)   and so  on.... 
1  Hospira, Inc. (Reverified 10/21/2016)   and so  on....  
2  Mission Pharmacal (Reverified 01/21/2015)   and so  on....  
....
n  Mission Pharmacal (Reverified 01/21/2015)   and so  on....  

我尝试过:

df['col'] = df['content'].str.split('\n', expand = true)

显然,我得到的项目数量错误,超过了 45,放置意味着 1。而且因为我正在这样做:

df = pd.DataFrame(lis, columns = ['content'])

我无法使用sep

最佳答案

类似问题here

df = pd.DataFrame(['The quick brown\n fox jumps \nover the \n lazy dog',
'The quick brown\n fox jumps \nover the \n lazy dog',
'The quick brown\n fox jumps \nover the \n lazy dog','The quick brown\n fox jumps \nover the \n lazy dog'], columns = ['data'])

foo = lambda x: pd.Series([i for i in reversed(x.split('\n'))])
rev = df['data'].apply(foo)

编辑 经过这里的讨论后,更新了代码,将多个文件加载到单个数据帧中:

allFiles_df = None
for it, currFile in enumerate(files):

    df = pd.read_csv(currFile, sep = '\n', header = None)
    df.columns = ['data']

    splitFunc = lambda x: pd.Series([i for i in reversed(x.split('\\n'))])

    df = df['data'].apply(splitFunc)
    df = df.stack().to_frame().reset_index().drop(['level_1'],axis = 1)
    df = df[df[0].str.len() >2]
    df['fileNo'] = it

    allFiles_df = pd.concat([allFiles_df,rev])

allFiles_df.columns = ['rowNo','text','fileNo']

主要注意事项: '\n' 是原始数据中的文本,因此它会作为 '\\n' 读入 python。 read_csv 中的 sep 关键字不允许分隔多个字符,这就是您遇到问题的原因。

这将输出每个字符串所在的文件和行号。它假设 files 变量包含带有路径的文件名列表。

关于python - 如何通过字符串将 pandas 数据框分割成 block ?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40414385/

相关文章:

python-3.x - 如果初始条件触发终端 = True 的事件,Solve_ivp 集成将卡住

python-3.x - 在 BeautifulSoup 中同时打印不同结果集中的不同标签

python - 在这种情况下我应该使用哪种图像质量指标(二值分割图像)?

python - 调整大小时如何压缩hdf5文件?

list - F#中元素的最优雅组合

javascript - 如何使用 javascript 读取 sharePoint 列表项值(当前项)

python-3.x - 有没有办法检查光驱是否有带有python的CD

python - 如何使用 matplotlib 创建一个非常大的数据集的条形图?

python - 如何在Python中使锯齿状数组变得整齐?

python - 列表更改意外反射(reflect)在子列表中