通过创建新表在客户表上进行选择转换。新的目标表应该只有三列,c_custkey
(无更改),c_address
和c_city
。
对于c_address
列,将其缩短为5个字符。
对于c_city
,在其末尾添加一个空格和一个#
以指示数字(例如UNITED KI2
=> UNITED KI #2
或INDONESIA4
=> INDONESIA #4
)。
create table customer (
c_custkey int,
c_name varchar(25),
c_address varchar(25),
c_city varchar(10),
c_nation varchar(15),
c_region varchar(12),
c_phone varchar(15),
c_mktsegment varchar(10)
);
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
create table customer_ty (
c_custkey int,
c_address STRING,
c_city STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
;
客户表数据
1|Customer#000000001|j5JsirBM9P|MOROCCO 0|MOROCCO|AFRICA|25-989-741-2988|BUILDING|
2|Customer#000000002|487LW1dovn6Q4dMVym|JORDAN 1|JORDAN|MIDDLE EAST|23-768-687-3665|AUTOMOBILE|
3|Customer#000000003|fkRGN8n|ARGENTINA7|ARGENTINA|AMERICA|11-719-748-3364|AUTOMOBILE|
4|Customer#000000004|4u58h f|EGYPT 4|EGYPT|MIDDLE EAST|14-128-190-5944|MACHINERY|
5|Customer#000000005|hwBtxkoBF qSW4KrI|CANADA 5|CANADA|AMERICA|13-750-942-6364|HOUSEHOLD|
6|Customer#000000006| g1s,pzDenUEBW3O,2 pxu|SAUDI ARA2|SAUDI ARABIA|MIDDLE EAST|30-114-968-4951|AUTOMOBILE|
7|Customer#000000007|8OkMVLQ1dK6Mbu6WG9|CHINA 0|CHINA|ASIA|28-190-982-9759|AUTOMOBILE|
8|Customer#000000008|j,pZ,Qp,qtFEo0r0c 92qo|PERU 6|PERU|AMERICA|27-147-574-9335|BUILDING|
9|Customer#000000009|vgIql8H6zoyuLMFN|INDIA 6|INDIA|ASIA|18-338-906-3675|FURNITURE|
10|Customer#000000010|Vf mQ6Ug9Ucf5OKGYq fs|ETHIOPIA 9|ETHIOPIA|AFRICA|15-741-346-9870|HOUSEHOLD|
最佳答案
可以使用regexp轻松完成c_city
转换,请参见Hive的示例:
select regexp_replace('INDONESIA4', '(.*?)(\\d+)$','$1 #$2');
结果
INDONESIA #4
目前尚不清楚如何缩短地址。请阐明规则,我或其他人可能也会对此有所帮助。
关于python - 用于在hadoop中转换表的Python代码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61736000/