java - 使用 Java 中的 Sparks csv 阅读器加载以 3 个空格作为分隔符的数据文件

标签 java csv apache-spark spark-dataframe

我有一个我正在尝试读入的数值数据文件,数据如下所示:

   1   6   4  12   5   5   3   4   1  67   3   2   1   2   1   0   0   1   0   0   1   0   0   1   1 
   2  48   2  60   1   3   2   2   1  22   3   1   1   1   1   0   0   1   0   0   1   0   0   1   2 

由 3 个空格分隔。我想把它放在 Spark DataFrame 中。

我正在努力解析它,它似乎将每一行都读作一个大字符串。

我厌倦了以下;

Dataset<Row> df = spark.read().format("com.databricks.spark.csv")
            .option("header", "false")
            .option("delimter", "\t")
            .load(csvFile);
    df.show(5);

还有:

.option("delimter", "   ") // leads to java error that Delimter cant take more than one character

也厌倦了 .option("sep", "\t") 而不是 "delimter":

这是我的完整代码:

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class CreditRiskML {

static SparkSession spark = SparkSession.builder()
        .appName("Credit Risk ML")
        .master("local[*]")
        .config("spark.sql.warehouse.dir", "E:/Exp/")
        .getOrCreate();

public static double parseDouble(String str){
    return Double.parseDouble(str);
}



public static void main(String[] args){

    String csvFile = "input\\credit.data";
    Dataset<Row> df = spark.read().format("com.databricks.spark.csv")
            .option("header", "false")
            .option("delimter", "\t")
            .option("sep", "\t")
            .load(csvFile);
    df.show(5);

    //create RDD of type Credit

    JavaRDD<Credit> creditRdd = df.toJavaRDD().map(new Function<Row, Credit>() {
        @Override
        public Credit call(Row r) throws Exception {
            return new Credit(parseDouble(r.getString(0)), parseDouble(r.getString(1)) - 1,
                    parseDouble(r.getString(2)), parseDouble(r.getString(3)), parseDouble(r.getString(4)),
                    parseDouble(r.getString(5)), parseDouble(r.getString(6)) - 1, parseDouble(r.getString(7)) - 1,
                    parseDouble(r.getString(8)), parseDouble(r.getString(9)) - 1, parseDouble(r.getString(10)) - 1,
                    parseDouble(r.getString(11)) - 1, parseDouble(r.getString(12)) - 1,
                    parseDouble(r.getString(13)), parseDouble(r.getString(14)) - 1,
                    parseDouble(r.getString(15)) - 1, parseDouble(r.getString(16)) - 1,
                    parseDouble(r.getString(17)) - 1, parseDouble(r.getString(18)) - 1,
                    parseDouble(r.getString(19)) - 1, parseDouble(r.getString(20)) - 1);
        }
    });

    //Create a dataset of type Row from the RDD of type Credit
    Dataset<Row> creditData = spark.sqlContext().createDataFrame(creditRdd, Credit.class);

    creditData.show(5);

 }
}

错误信息:

java.lang.NumberFormatException: For input string: "1   6   4  12   5   5   3   4   1  67   3   2   1   2   1   0   0   1   0   0   1   0   0   1   1"

解决这个问题的最佳方法是什么? 非常感谢任何帮助。

P.s 这里是信用等级:

public class Credit {
    private double creditability;
    private double balance;
    private double duration;
    private double history;
    private double purpose;
    private double amount;
    private double savings;
    private double employment;
    private double instPercent;
    private double sexMarried;
    private double guarantors;
    private double residenceDuration;
    private double assets;
    private double age;
    private double concCredit;
    private double apartment;
    private double credits;
    private double occupation;
    private double dependents;
    private double hasPhone;
    private double foreign;

    public Credit(double creditability, double balance, double duration, double history, double purpose, double amount,
                  double savings, double employment, double instPercent, double sexMarried, double guarantors,
                  double residenceDuration, double assets, double age, double concCredit, double apartment, double credits,
                  double occupation, double dependents, double hasPhone, double foreign) {
        super();
        this.creditability = creditability;
        this.balance = balance;
        this.duration = duration;
        this.history = history;
        this.purpose = purpose;
        this.amount = amount;
        this.savings = savings;
        this.employment = employment;
        this.instPercent = instPercent;
        this.sexMarried = sexMarried;
        this.guarantors = guarantors;
        this.residenceDuration = residenceDuration;
        this.assets = assets;
        this.age = age;
        this.concCredit = concCredit;
        this.apartment = apartment;
        this.credits = credits;
        this.occupation = occupation;
        this.dependents = dependents;
        this.hasPhone = hasPhone;
        this.foreign = foreign;
    }

    public double getCreditability() {
        return creditability;
    }

    public void setCreditability(double creditability) {
        this.creditability = creditability;
    }

    public double getBalance() {
        return balance;
    }

    public void setBalance(double balance) {
        this.balance = balance;
    }

    public double getDuration() {
        return duration;
    }

    public void setDuration(double duration) {
        this.duration = duration;
    }

    public double getHistory() {
        return history;
    }

    public void setHistory(double history) {
        this.history = history;
    }

    public double getPurpose() {
        return purpose;
    }

    public void setPurpose(double purpose) {
        this.purpose = purpose;
    }

    public double getAmount() {
        return amount;
    }

    public void setAmount(double amount) {
        this.amount = amount;
    }

    public double getSavings() {
        return savings;
    }

    public void setSavings(double savings) {
        this.savings = savings;
    }

    public double getEmployment() {
        return employment;
    }

    public void setEmployment(double employment) {
        this.employment = employment;
    }

    public double getInstPercent() {
        return instPercent;
    }

    public void setInstPercent(double instPercent) {
        this.instPercent = instPercent;
    }

    public double getSexMarried() {
        return sexMarried;
    }

    public void setSexMarried(double sexMarried) {
        this.sexMarried = sexMarried;
    }

    public double getGuarantors() {
        return guarantors;
    }

    public void setGuarantors(double guarantors) {
        this.guarantors = guarantors;
    }

    public double getResidenceDuration() {
        return residenceDuration;
    }

    public void setResidenceDuration(double residenceDuration) {
        this.residenceDuration = residenceDuration;
    }

    public double getAssets() {
        return assets;
    }

    public void setAssets(double assets) {
        this.assets = assets;
    }

    public double getAge() {
        return age;
    }

    public void setAge(double age) {
        this.age = age;
    }

    public double getConcCredit() {
        return concCredit;
    }

    public void setConcCredit(double concCredit) {
        this.concCredit = concCredit;
    }

    public double getApartment() {
        return apartment;
    }

    public void setApartment(double apartment) {
        this.apartment = apartment;
    }

    public double getCredits() {
        return credits;
    }

    public void setCredits(double credits) {
        this.credits = credits;
    }

    public double getOccupation() {
        return occupation;
    }

    public void setOccupation(double occupation) {
        this.occupation = occupation;
    }

    public double getDependents() {
        return dependents;
    }

    public void setDependents(double dependents) {
        this.dependents = dependents;
    }

    public double getHasPhone() {
        return hasPhone;
    }

    public void setHasPhone(double hasPhone) {
        this.hasPhone = hasPhone;
    }

    public double getForeign() {
        return foreign;
    }

    public void setForeign(double foreign) {
        this.foreign = foreign;
    }
}

最佳答案

解决这个问题的一种方法是使用 java.util.Scanner。因为您使用的是空格,所以不需要指定分隔符。

String s = "1   0   2   0";
Scanner scanner = new Scanner(s);

while(scanner.hasNext()){
  System.out.println(scanner.next());
}

输出将是:

1
0
2
0

无论给定字符串中的空格数量如何,这都有效。

关于java - 使用 Java 中的 Sparks csv 阅读器加载以 3 个空格作为分隔符的数据文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43678852/

相关文章:

javascript - 如何正确组装 csv 文件以避免列移位

hadoop - Mapreduce作业提交与Spark作业提交

apache-spark - Pyspark 日志记录 : Printing information at the wrong log level

javax.jcr.* javadoc 丢失

java - 从服务器加载 XML 布局 (android)

java - 对于以不同方式创建的等效 Jackson 对象,assertEquals 失败

hadoop - 在 apache spark 中访问以下划线开头的文件

java - jd gui在win10上打不开

java - 如何将具有关系的 Java 对象从 .csv 文件导入 MySQL 数据库?

python - 如何从嵌套字典创建逗号分隔值?