python - 返回包含字符串内容的行，其中不包含超过特定最大长度的单词，同时保留和过滤掉包含特定内容的单词

这是我的数据框

输入

        qid                     question_stemmed    target  question_length total_words
443216  56da6b6875d686b48fde    mathfracint1x53x5 tantanboxedint1x01x2 sumvarp...   1   589 40
163583  1ffca149bd0a19cd714c    mathoverbracesumvartheta8infty vecfracsumkappa...   1   498 31
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

我使用以下逻辑仅从 Question_text 列具有的 df 返回记录

长度不超过 15 个字符的任何单词(注意:不是字符串长度)(使用否定)
当上述条件成立时不应包含数值的任何单词 (使用否定)
同时确保保留具有 http 或 www 值的单词(同时上述 2 个条件仍然成立)

df = df[(~df['question_stemmed'].str.len() > 15) & (~df['question_stemmed'].str.contains(r'[0-9]') ) & (df.question_stemmed.str.match('^[^\http]*$'))]

出现错误 错误:位置 3 处错误转义\h

预期输出

        qid                     question_stemmed     target    question_length  total_words
522266  663c7523d48f5ee66a3e    httpgooglecom check out the content of the www..    0   449 66
522379  756678d3d48f5ee66a3e    mark had a great day he plans to go fishing with.   0   310 23

另外，想知道上面的逻辑是否能够满足所有 3 个条件感谢任何帮助

最佳答案

我建议使用

df = df[~df['question_stemmed'].str.contains(r'(?<!\S)(?!\S*(?:http|www\.))\S{15}')]

请参阅regex demo

详细信息

(?<!\S) - 空格或字符串开头应紧接在当前位置之前
(?!\S*(?:http|www\.)) - http 后面没有 0 个或多个非空白字符或www.允许紧邻当前位置右侧的子字符串
\S{15} - 十五个非空白字符。

关于python - 返回包含字符串内容的行，其中不包含超过特定最大长度的单词，同时保留和过滤掉包含特定内容的单词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62918301/

python - 返回包含字符串内容的行，其中不包含超过特定最大长度的单词，同时保留和过滤掉包含特定内容的单词

上一篇：Python网络X : edges color in a weighted graph

下一篇：python - 使用 boto3、Python 从 S3 存储桶查找最新的 CSV 文件