python - 将 csv 上传到大查询时添加日期加载字段

标签 python google-bigquery

使用Python。 有没有办法在将 csv 文件处理到 Big Query 时添加额外的字段。 我想添加包含当前日期的 date_loaded 字段?

我使用过的 Google 代码示例..

# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_id = 'my_dataset'

dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('post_abbr', 'STRING')
]
job_config.skip_leading_rows = 1    
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
uri = 'gs://cloud-samples-data/bigquery/us-states/us-states.csv'
    load_job = client.load_table_from_uri(
    uri,
    dataset_ref.table('us_states'),
    job_config=job_config)  # API request
print('Starting job {}'.format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print('Job finished.')

destination_table = client.get_table(dataset_ref.table('us_states'))
print('Loaded {} rows.'.format(destination_table.num_rows))

最佳答案

通过修改此Python example为了解决您的问题,您可以从我的本地 PC 打开并读取原始 CSV 文件,通过添加列并在每行末尾附加时间戳来编辑它,以避免出现空列。 This link解释如何在 Python 中使用自定义日期和时间获取时间戳。

然后将结果数据写入输出文件并将其加载到 Google Storage。 Here您可以找到有关如何从 Python 文件运行外部命令的信息。

我希望这会有所帮助。

#Import the dependencies
import csv,datetime,subprocess
from google.cloud import bigquery

#Replace the values for variables with the appropriate ones
#Name of the input csv file
csv_in_name = 'us-states.csv'
#Name of the output csv file to avoid messing up the original
csv_out_name = 'out_file_us-states.csv'
#Name of the NEW COLUMN NAME to be added
new_col_name = 'date_loaded'
#Type of the new column
col_type = 'DATETIME'
#Name of your bucket
bucket_id = 'YOUR BUCKET ID'
#Your dataset name
ds_id = 'YOUR DATASET ID'
#The destination table name
destination_table_name = 'TABLE NAME'


# read and write csv files
with open(csv_in_name,'r') as r_csvfile:
    with open(csv_out_name,'w') as w_csvfile:

        dict_reader = csv.DictReader(r_csvfile,delimiter=',')
        #add new column with existing
        fieldnames = dict_reader.fieldnames + [new_col_name]
        writer_csv = csv.DictWriter(w_csvfile,fieldnames,delimiter=',')
        writer_csv.writeheader()


        for row in dict_reader:
#Put the timestamp after the last comma so that the column is not empty
            row[new_col_name] = datetime.datetime.now()
            writer_csv.writerow(row)

#Copy the file to your Google Storage bucket
subprocess.call('gsutil cp ' + csv_out_name + ' gs://' + bucket_id , shell=True)


client = bigquery.Client()

dataset_ref = client.dataset(ds_id)
job_config = bigquery.LoadJobConfig()
#Add a new column to the schema!
job_config.schema = [
    bigquery.SchemaField('name', 'STRING'),
    bigquery.SchemaField('post_abbr', 'STRING'),
    bigquery.SchemaField(new_col_name, col_type)
]
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#Address string of the output csv file
uri = 'gs://' + bucket_id + '/' + csv_out_name
load_job = client.load_table_from_uri(uri,dataset_ref.table(destination_table_name),job_config=job_config)  # API request
print('Starting job {}'.format(load_job.job_id))

load_job.result()  # Waits for table load to complete.
print('Job finished.')

destination_table = client.get_table(dataset_ref.table(destination_table_name))
print('Loaded {} rows.'.format(destination_table.num_rows))

关于python - 将 csv 上传到大查询时添加日期加载字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53412407/

相关文章:

python - 如何使用 matplotlib 从数据框中加载条形图数据

google-cloud-platform - 用于云存储传输的 BigQuery 数据传输服务是否可以使用 Terraform 实现?

python - 如何显示模型的输出?

python - PyCharm 调试器中的 UnicodeDecodeError

google-analytics - 为什么BigQuery中的hits.transaction ID为null?

mysql - 大查询转置

node.js - 使用服务帐户获取 401 上传文件到表中

google-bigquery - 我如何使用新的 UDF 功能来创建 "Dynamic SQL statement"?

python - FLask-admin ,可生成字段

python - 使用类型为 "object"的 numpy 数组创建混合类型的 Pandas Dataframe