regex - 具有多行记录的文本文件的 Hive 外部表定义

标签 regex hadoop hive

我需要将此文件解析为一个配置单元表,该表是来自亚马逊的电影评论数据集。我在构建正则表达式以解析 .txt 文件并创建具有正确列类型的表时遇到问题。

.txt

product/productId: B0001G6PZC
review/userId: A3F3THLLZXURQN
review/profileName: A. Y
review/helpfulness: 3/3
review/score: 4.0
review/time: 1199664000
review/summary: Good story, Good action. Good Drama. Good Movie
review/text: When I first heard of this movie, I didn't think it would be that great, so I never bothered to go see it in theaters. Later on, I ended up downloading the movie, and didn't think much of it.<br /><br />But now after watching the movie on BD, I think that the movie is quite outstanding. Its got a good story behind it, with some level of historical basis behind it with Samurai becoming phased out into Japan's modernization.<br /><br />It does a good job in immersing you into the conflicts that warriors must endure... and yet, find peace with the way of the Samurai as they are a warrior race and not savages.<br /><br />4/5 stars.

product/productId: B0001G6PZC
review/userId: A3J78KAIPW6KAH
review/profileName: Joan Paolo De Bastos "conde_almasy"
review/helpfulness: 3/3
review/score: 4.0
review/time: 1198540800
review/summary: Good Movie. Wonderful Visuals. A Great Way to SHOW OFF you Hi-Def System
review/text: Last Samurai is no masterpiece<br /><br />but technically it is<br /><br />the visuals, the sound effects, the music.<br /><br />If you want to show off to your friends what a great hi-def system you got, purchase this movie.<br /><br />If you want a classic, but lord of the rings or gone with the wind instead.

product/productId: B0001G6PZC
review/userId: A3F3B6HY9RJI04
review/profileName: James Duckett
review/helpfulness: 3/3
review/score: 5.0
review/time: 1192060800
review/summary: Great Movie, Fantastic HD Quality
review/text: After picking up my HD DVD player I've had troubles watching regular DVD movies.  I had heard some good things about this movie but couldn't pass it up once it was in high definition.<br /><br />The story is pretty good.  This is the story of Captain Algren who has been sent to Japan in the late 1800's in order to help them modernize the Japanese army as they go from fighting with swords and arrows to machine guns and cannons.<br /><br />After the "modern" Japanese army prematurely attacks the Samurai and lose horribly, Captain Algren is taken captive by the Samurai and introduced to their way of life and refusal to lay down the sword in the name of compliance.  In time, Captain Algren finds himself wanting to become one of the Samurai and learning more of their way of life.<br /><br />The story is pretty good but what raises this up to the level of being outstanding is the high definition quality of the movie.  It was fantastic, especially seeing the colorful Japanese landscape in all of its magnificence.<br /><br />If you like Tom Cruise action movies, this is one to pick up especially in high definition (whether it be Blu-Ray or HD DVD).  The violence can be extremely graphic (hey, this is war) so if you are sensitive to that you may want to look for something else.  Otherwise, the pacing of the movie is pretty good.  It isn't an all out gore-fest... there is action and then it breaks and lets you relax and catch up a little bit and then goes back to action and so on and so forth.

这是我的 SQL:

CREATE EXTERNAL TABLE movies(id string, uId string, profileName string, helpfulness string, score float, time int, summary string, text string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH serdeproperties( "input.regex" = "[ ].*", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s"")
location '/user/hduser/moviesTest';

但是配置单元没有正确解析它并且:SELECT * FROM movies 给我这个结果:

NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL

谁能告诉我我做错了什么?

最佳答案

这可以通过 Hive UDF 轻松完成;

你的数据在表 suppose temp 中,单列命名为 line;

create table temp(line String);
load data local inpath 'review.txt' into table temp;
select line from temp;

roduct/productId: B0001G6PZC
review/userId: A3F3THLLZXURQN
review/profileName: A. Y
review/helpfulness: 3/3

review/score: 4.0
review/time: 1199664000
review/summary: Good story, Good action. Good Drama. Good Movie
review/text: When I first heard of this movie, I didn't think it would be that great, so I never bothered to go see it in theaters. Later on, I ended up downloading the movie, and didn't think much of it.<br /><br />But now after watching the movie on BD, I think that the movie is quite outstanding. Its got a good story behind it, with some level of historical basis behind it with Samurai becoming phased out into Japan's modernization.<br /><br />It does a good job in immersing you into the conflicts that warriors must endure... and yet, find peace with the way of the Samurai as they are a warrior race and not savages.<br /><br />4/5 stars.

product/productId: B0001G6PZC
review/userId: A3J78KAIPW6KAH
review/profileName: Joan Paolo De Bastos "conde_almasy"
review/helpfulness: 3/3
review/score: 4.0
review/time: 1198540800
............

............

在java中创建一个Hive Udf。来源在这里

package HiveUDF;

import org.apache.hadoop.hive.ql.exec.UDF;

public class ReviewDataUdf extends UDF {
    String s = " ";
    String structuredFormat;
    int inds = 0;
    int inde = 0;

    public String evaluate(String t) {
        s = s + " " + t;
        if (t.contains("review/text:")) {
            String productId = "";
            try {
                if (s.contains("product/productId:")) {
                    inds = s.indexOf("product/productId:");
                    inde = s.indexOf("review/userId:", inds);
                    productId = s.substring(inds + 18, inde);
                } else {
                    productId = "N/A";
                }

            } catch (Exception e) {
                productId = "";
            }
            String userId = "";
            try {
                if (s.contains("review/userId:")) {

                    inds = s.indexOf("review/userId:");
                    inde = s.indexOf("review/profileName:", inds);
                    userId = s.substring(inds + 14, inde);
                } else {
                    userId = "N/A";
                }
            } catch (Exception e) {
                userId = "";
            }

            String profileName = "";
            try {
                if (s.contains("review/profileName:")) {
                    inds = s.indexOf("review/profileName:");
                    inde = s.indexOf("review/helpfulness:", inds);
                    profileName = s.substring(inds + 19, inde);
                } else {
                    profileName = "N/A";
                }
            } catch (Exception e) {
                profileName = "";
            }

            String helpfulness = "";
            try {
                if (s.contains("review/helpfulness:")) {
                    inds = s.indexOf("review/helpfulness:");
                    inde = s.indexOf("review/score:", inds);
                    helpfulness = s.substring(inds + 20, inde);
                } else {
                    helpfulness = "N/A";
                }
            } catch (Exception e) {
                helpfulness = "";
            }

            String score = "";
            try {
                if (s.contains("review/score:")) {
                    inds = s.indexOf("review/score:");
                    inde = s.indexOf("review/time:", inds);
                    score = s.substring(inds + 14, inde);
                } else {
                    score = "N/A";
                }
            } catch (Exception e) {
                score = "";
            }

            String time = "";
            try {
                if (s.contains("review/time:")) {
                    inds = s.indexOf("review/time:");
                    inde = s.indexOf("review/summary:", inds);
                    time = s.substring(inds + 14, inde);
                } else {
                    time = "N/A";
                }
            } catch (Exception e) {
                time = "";
            }

            String summary = "";
            try {
                if (s.contains("review/summary:")) {
                    inds = s.indexOf("review/summary:");
                    inde = s.indexOf("review/text:", inds);
                    summary = s.substring(inds + 16, inde);
                } else {
                    summary = "N/A";
                }
            } catch (Exception e) {
                summary = "";
            }

            String text = "";
            try {
                if (s.contains("review/text:")) {
                    inds = s.indexOf("review/text:");
                    inde = s.indexOf(s.length(), inds);
                    text = s.substring(inds + 14);
                } else {
                    text = "N/A";
                }
            } catch (Exception e) {
                text = "";
            }
            structuredFormat = productId + "\t" + userId + "\t" + profileName + "\t" + helpfulness + "\t" + score
                    + "\t" + time + "\t" + summary + "\t" + text;
            s = "";
            return structuredFormat.trim();
        } else {
            return null;
        }
    }
}

导出ReviewDataUdf.jar,在hive中注册并创建函数。

hive> ADD JAR /home/Kishore/ReviewDataUdf.jar;

hive> create temporary FUNCTION structReview as 'HiveUDF.ReviewDataUdf';

使用structReview函数获取结构化数据。

Create table AmazonReview as
select split(review,"\t")[0] as productId, split(review,"\t")[1] as userId, split(review,"\t")[2] as profileName,split(review,"\t")[3] as helpfulness, split(review,"\t")[4] as score,split(review,"\t")[5] as time,split(review,"\t")[6] as summary,split(review,"\t")[7] as text from(
select structReview(line) As review from temp ) b
where review != "NULL";

数据在 AmazonReview 表中采用结构化格式

select productId, userId, profileName from AmazonReview;
OK
B0001G6PZC   A3F3THLLZXURQN      A. Y 
B0001G6PZC   A3J78KAIPW6KAH      Joan Paolo De Bastos "conde_almasy" 
B0001G6PZC   A3F3B6HY9RJI04      James Duckett

关于regex - 具有多行记录的文本文件的 Hive 外部表定义,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30387682/

相关文章:

hadoop - 如何通过 Hadoop shell 脚本指定 Hadoop XML 配置变量?

hadoop - 使用主机系统上的客户端访问在 VM 中运行的 HBase

hadoop - 使用配置单元搜索文档中特定单词的出现

python - 执行 re.findall 后查找正则表达式模式

php - 替换所有字符直到反斜杠n次

hadoop - Apache Pig : java. lang.OutOfMemoryError:Java 堆空间

bash - 如何从 shell 中的最大日期中提取最后 7 天的行

hadoop - Hive - 如何从文件名作为列的文件中加载数据?

c# - 去除一个空格的正则表达式

JavaScript 使用带有 .match 正则表达式的变量