python - 从 .txt 文件中删除 html 和 uuencode

我想处理一个包含大量 html 和 uuencode 字符的文本文件:

例如，请参阅以下链接中的 .txt 文件:

https://www.sec.gov/Archives/edgar/data/1522690/000121390016011794/0001213900-16-011794.txt

我正在使用以下代码:

从 bs4 导入 BeautifulSoup

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)

with open("C:/EDGAR/forms_to_process/10K/20160322_10-K_edgar_data_1522690_0001213900-16-011794_1.txt") as f:
    lines = f.readlines()
    with open("PROCESSED.txt", 'w', encoding='utf-8') as f1:
        i=1
        for line in lines:
            soup = BeautifulSoup(line, "lxml")
            print(i, "Initial line: ", line)
            print(i, "Soup get text line: ", soup.get_text())
            bs_line = soup.get_text()
            ascii_line = strip_non_ascii(bs_line)
            print(i, "Ascii line: ", ascii_line)
            f1.write(ascii_line)
            i=i+1


f.close()
f1.close();

这将文件从 8.5 MB 减少到 2.5 MB，但它仍然有很多我不需要的元素，例如:

</tr>
<tr style="vertical-align: bottom; background-color: #cceeff;">
<td
style="padding: 0px 0px 0px 10pt; text-indent: -10pt;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: right;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>
<td style="text-align: left;"><font style="font-family: 'times new roman', times, serif;">&#160;</font></td>

还有

EXCEL
86
Financial_Report.xlsx
IDEA: XBRL DOCUMENT

begin 644 Financial_Report.xlsx
M4$L#!!0    (  J%=D@6'2-4(0(  $8I   3    6T-O;G1E;G1?5'EP97-=
M+GAM;,W:2V[;,! &X*L8VA86S62DZ(U
MW")I8^#?6):'G!EII&_EJV\/@=+BX(8QK:LNY_"!L=1TY&RJ?:"Q1#8^.IO+
M:=RR8)N=W1(3JY5AC1\SC7F9IQS5]=67/<78M[3X> Q,N=>5#6'H&YM[/[+]
MV)YD7?K-IF^H]M31U1=D.=\L- Z5S]8^2I\@UM[-V07U3X\=[5D89Y3>KZ\%CJTZ%D2>6W=56B
MZ5D53C?^K;/>34,+X_:W'=/Y/U[+R4WM[KY[OWO-QX2FJVJI7898%L;M([5?MWH+T\0ZDC_M D56@2*K0)%5H,@J4&05*+(*%%D%BJP215:)(JM$D56BR"I19)4HLDH4626*
MK!)%5HDBJT*15:'(JE!D52BR*A19%8JL"D56A2*K0I%5HMBJP:15:-(JM&D56CR*I19-4HLAH460V*K 9%5H,BJT&1U:#(:E!D-2BR&A19

有没有办法删除这些内容并仅保留文本文件中包含的相关文本信息？

编辑:从我提供的链接中，我想保留的文本示例是:

<P STYLE="font: 10pt/normal Times New Roman,serif; margin: 0; text-align: justify">The table above indicates the current yields
to maturity (YTM) for the senior bonds of selected life insurance carriers with durations, on average, that our similar to our
life insurance portfolio.&nbsp; The average yield to maturity of these bonds was 3.02% which, we believe, reflects in part the
financial market&rsquo;s judgement that credit risk is low with regard to these carriers&rsquo; financial obligations. It should
be noted that the obligations of life insurance carriers to pay life insurance policy benefits is senior in rank to any other obligation.&nbsp;
This &ldquo;super senior&rdquo; priority is not reflected in the yield to maturity in the table and, if considered, would result
in a lower yield to maturity all else being equal. As such, as long as the respective premium payments have been made, it is highly
likely that the owner of the insurance policy will collect the insurance policy benefit upon the mortality of the insured.</P>

即我想删除所有 html 标签和 uuencoding 二进制文件，只保留文本。

编辑2:

Gerrit 下面的响应绝对非常接近我想要实现的目标，至少对于正在考虑的 .txt 文件来说是这样。但是，它仍然在文件末尾留下以下部分:

Actuarial Pricing Systems, LP Model Actuarial
Pricing Systems, LP  33(Q7.U=JG''<]S7/R,ZG4BCJ0V3TKG/'&I;?V=X:N-K;9;C]RA^O4_EFG
M:==/<^*KESYJ(^GP2")\_*26SQV-%M9T2^ER$N(E=_96.&'X J:]=&,<=*\L\2V
MWB>ZTU9M7LH$M[;D-$5!4'CL3QTKH]*\07E[I&CVUFT;(NYU=))9E+!!&!G@$
M9)RO?O6N(3G%3OKL88:2IRE"SMNCL=X]*7--R3Z'/J"VI>Y=WC\L,/)7RB<9
MSR?>CD8O:)['4%@!D\#UKE_'K!O",S @CS(R#G_:%5AKUS=23VDLUO<03V<[
MI)#"Z!2HZ MPXP>HJ'Q!@?#*UQ_SRM^G_ :TI1:J1OW1E6FG3E;LS)70=)?X
M>KJDR>7>>4S"7>?F8,0!CH<]*W_AU<3/X==)22D<[)%GLN G1LGELK%Y,Q;NN>.3R>^>
MU;5IIIPO=W,*$&I1G;:RM]YUV[C.* V:YBPU'4KQX[33Q:0I;6L#R"56;<77(
M5<'@ #KS4"ZI=P2R0V,5LDL^JR6Y+[B.$SN//7CH./I7+RL[>=;G79J.;?Y;
M>7]_:=OUQQ7+2>([Z&W6;*>2TAG%Z]I)<%28U"KNW;_N9M#%]J
M6U6(=SLC*@("<'!YY S^-)Q:!33T/-/#;Z,NHW2>)(B97; >7.U7R=V[N#[F
MO3=$TO3]+LW73&+6\SF4'?O'(QP?3BN?U:#PGX@L)+\7=M%-LW"='VOTXW+W
M^A%8O@W4;^QT'5)X4\R"V>.0HP) &?W@'OMYKLJIU(.:NMM'M\CBHVI3Y&D;]
M]5^IZAFDWO7[&]EL8U>TAFCA$PC:0J-NYWV@Y8#*C ]Z;%>ZC=Z_I
MC07]M);2V;2OY<;;'PR@D#/7GCTYZURM1VNNZ[=)IKJM@HU'>D8*M^Z*Y.X\\\ \<N>*ZN*020QOO1M
MR@[D/!]Q[5+BUN4I)G@##YC]:3%.;[Q^M)7T9\M<3%:WAO\ Y#D'T;^1K*K4
M\.?\AR#Z-_(UE7_AR-B23R64-RK3D%]RNW0D\9Z=33+?1-'MM9?5HK>X%TQ9B2C[
M06ZD#'^/\ 9%.TO6[IY[:QDMY9B$03W&[.
M&9-^>@&.0*J\M7W8NRTI8_P"BQ(%V,N\*#N'S @L,@_A4
M=QX@U18YW2TAAV6;3[)F.X,'*YZ=.,]J/>[A[BOH6;K1-)NWN#(EZ([EM\T2
M&14=O[Q4=^E33Z=837+W*-?V\LBA9&MS;(GF <#=CJ??K5.;Q#=V=U/$8/M,K
M2X2)"<*!$C, 0.>3QFGR^)+M))!'IZLB-(H+SX/[M0S9&..#^='O#O THH;.
M&^>\1+CSWB6%BR.:J0Z1I4.MR:NL5R;Q\Y9E<@9&#

这似乎是 uuencoding 二进制部分。知道如何摆脱这个吗？

最佳答案

我不会过滤掉不需要的文本，而是使用 soup 来选择您真正需要的文本。如果此文本包含在 <p> 中然后标记:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

only_p_tags = SoupStrainer("p")

soup = BeautifulSoup(open("C:/EDGAR/forms_to_process/10K/20160322_1‌0-K_edgar_data_15226‌90_0001213900-16-011‌794_1.txt"), "html.parser", parse_only=only_p_tags)

for p in soup:
    print p.get_text()

关于python - 从 .txt 文件中删除 html 和 uuencode，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40891386/

python - 从 .txt 文件中删除 html 和 uuencode

上一篇：python - 在使用 numpy.genfromtxt 创建的 ndarray 中插入列

下一篇：python - 数据帧转换 |更好的方法？