python - 如何从python中的文本文件中获取子字符串?

标签 python string text

我有一堆明文形式的推文,如下所示。我只想提取文本部分

文件中的示例数据 -

Fri Nov 13 20:27:16 +0000 2015 4181010297 rt     we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured
Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19
Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible
Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2!
Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library
Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to  next week?
Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue?    @ golden bee

这是预处理阶段的我的尝试 -

for filename in glob.glob('*.txt'):
    with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
        for tweet in infile.readlines():
            temp=tweet.split(' ')
            text=""
            for i in temp:
                x=str(i)
                if x.isalpha() :
                    text += x + ' '
            print(text)

输出-

Fri Nov rt treating one of you lads to this denim simply follow rt to 
Fri Nov this album is so proud of i loved this it really is the 
Fri Nov international break is garbage boring and your players get 
Fri Nov get weather updates from the weather 
Fri Nov woah what happened to twitter this update is 
Fri Nov completed the daily quest in paradise island 
Fri Nov new henderson memorial public 
Fri Nov going to next 
Fri Nov why so golden 

此输出不是所需的输出因为

1。它不会让我在推文的文本部分获取数字/数字。
2. 每行以 FRI NOV 开头。

能否请您提出一个更好的方法来实现相同的目标?我不太熟悉正则表达式,但我想我们可以使用 re.search(r'2015(magic to remove tweetID)/w*',tweet)

最佳答案

在这种情况下,您可以避免使用正则表达式。你所呈现的文本行在推文文本之前的空格数方面是一致的。就split() :

>>> data = """
   lines with tweets here
"""
>>> for line in data.splitlines():
...     print(line.split(" ", 7)[-1])
... 
rt     we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to  next week?
why so blue?    @ golden bee

关于python - 如何从python中的文本文件中获取子字符串?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36850322/

相关文章:

python - 通过Python查找列表中特定位置的绝对最大值或最小值

java - 对简单的 While 循环感到困惑 - 躲在 table 下面

C - 如何为计算表达式的结果打印出不同的字符串?

text - Imagemagick和Web字体

node.js - 将存储在内存中的字符串传递给 pdftotext、antiword、catdoc 等

python - 可以在 shell 中逐行运行脚本,但完整的脚本不返回任何内容

python - 比较两台不同机器上的两个相同文件夹,SSH 问题

Python:定位文本框固定在角落并正确对齐

python - 将列表与另一个嵌套列表(无序)进行比较并输出列表

python - 将列表的字符串表示形式转换为列表