我正在编写一个脚本来处理我拥有的一些 csv 数据,以便将其呈现在 d3.js map 上,我在这句话之后立即提供完整的脚本,但只是为了上下文,这是我的重要部分问题如下。
# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map.
import csv
# The number of columns
num_headers = 9
# Remove invalid characters from records
def url_escaper(data):
for line in data:
yield line.replace('&','&')
# Be sure to configure input & output files
with open("input.txt", 'r') as file_in, open("try_this_output.txt", 'w') as file_out:
csv_in = csv.reader( url_escaper( file_in ) )
csv_out = csv.writer(file_out)
# Get rid of rows that have the wrong number of columns
# and rows that have only whitespace for a columnar value
for i, row in enumerate(csv_in, start=1):
for e in row:
if "|" in e:
e = e.split(";")[0]
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
有一些列值的结构如下:
linux|devops|firewall|vmware|.net-framework|.net|paas
我想使用以下代码片段将它们切片:
e.split("|")[0]
这样我就只剩下“|
”之前的文本第一部分,即在上面的示例 linux
中。
我需要将处理后的数据写入输出文件。
我知道该代码片段有效,但我不知道如何将其融入我的管道中。
这是我关心的部分:
for i, row in enumerate(csv_in, start=1):
for e in row:
if "|" in e:
e = e.split("|")[0]
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
特别是这个:
for e in row:
if "|" in e:
e = e.split(";")[0]
很明显,这不是实现这一目标的方法。
输入数据的示例如下:
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux|devops|firewall|vmware|.net-framework|.net|paas,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership|database|project management|design|scada,1
理想的输出是
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership,1
最佳答案
您可以使用正则表达式来解决问题。
我抓取了您的输入数据并将其放入文件“input.txt”中
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux|devops|firewall|vmware|.net-framework|.net|paas,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership|database|project management|design|scada,1
您的预期结果:
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMjk1MzYyMDY2IiwicyI6ImhiTUN6MTdUUkVPdWl5NUI2bDdwQXcifQ.A6MlT_WKpLx763hZe44X4pQ0KOMHYuITosCIwuMbPxM,Technical Account Manager, Technical Delivery Manager - Cloud,Peopleworks,Farnborough,51.293999,-0.754624,United Kingdom,linux,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJpIjoiMzA5MzE5OTExIiwicyI6Ik9feVBUT1VNVC0tcUZ2N1FvRWNVU1EifQ.C8ZAc9RAFSMdyaCaIIMB51-jGS01Az29VY8Dblc7QM4,Management Consultant - Utilities Smart Energy,Capgemini Consulting,Lee,51.451818,-0.02806,United Kingdom,leadership,1
这是我使用的代码:
import re
src = open('input.txt')
output = open('output.txt', 'w')
pat = r'([^,]*\|[^,]*)'
for line in src:
the_search = re.search(pat, line) # Search the line for something containing '|'
if the_search:
the_group = the_search.group(0) # Grab the capture group
value = the_group.split("|")[0] # Grab the first item after splitting based on '|'
new_line = re.sub(pat, value, line) # Use re.sub to replace that entire pattern with the value
output.write(new_line)
src.close()
output.close()
您需要定制此解决方案以适合您想要执行的操作,但正则表达式应该可以工作。
关于python - 对 csv 的各个列值实现 python split,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35204131/