mysql - Spark - 通过 Zeppelin EMR 连接到 mysql

标签 mysql scala apache-spark pyspark amazon-emr

我正在尝试从 AWS EMR - Zeppelin 笔记本连接到 MySQL 实例。将 mysql 连接器加载到此位置 - /usr/lib/spark/jars/mysql-connector-java-5.0.4-bin.jar 。并将其添加为齐柏林飞艇解释器中的工件。

启动驱动程序,

Class.forName("com.mysql.jdbc.Driver")

res77: Class[_] = class com.mysql.jdbc.Driver

像这里一样使用 Scala 代码,

试验 1,

val jdbcDF = spark.read.format("jdbc").options(
  Map("url" ->  "jdbc:mysql://our-host.readm.co.nz:3306/testdb?user=testuser&password=****",
  "dbtable" -> "testdb.tablename",
  "driver" -> "com.mysql.jdbc.Driver",
  "partitionColumn" -> "id", "lowerBound" -> "1", "upperBound" -> "41514638", "numPartitions" -> "20"
  )).load()

得到这个错误,

com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
** BEGIN NESTED EXCEPTION **
java.net.SocketException
MESSAGE: java.net.ConnectException: Connection timed out (Connection timed out)
STACKTRACE:
java.net.SocketException: java.net.ConnectException: Connection timed out (Connection timed out)
    at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:156)
    at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:276)
    at com.mysql.jdbc.Connection.createNewIO(Connection.java:2666)
    at com.mysql.jdbc.Connection.<init>(Connection.java:1531)
    at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:266)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:61)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
    at <init>(<console>:50)
    at <init>(<console>:55)
    at <init>(<console>:57)
    at <init>(<console>:59)
    at <init>(<console>:61)
    at <init>(<console>:63)
    at <init>(<console>:65)
    at <init>(<console>:67)
    at <init>(<console>:69)
    at <init>(<console>:71)
    at <init>(<console>:73)
    at <init>(<console>:75)
    at <init>(<console>:77)
    at <init>(<console>:79)
    at <init>(<console>:81)
    at <init>(<console>:83)
    at <init>(<console>:85)
    at <init>(<console>:87)
    at <init>(<console>:89)
    at <init>(<console>:91)
    at <init>(<console>:93)
    at .<init>(<console>:97)
    at .<clinit>(<console>)
    at .$print$lzycompute(<console>:7)
    at .$print(<console>:6)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
    at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
    at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
    at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
    at sun.reflect.GeneratedMethodAccessor369.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1000)
    at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:1205)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1172)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1165)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:498)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
    at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
** END NESTED EXCEPTION **
Last packet sent to the server was 1 ms ago.
  at com.mysql.jdbc.Connection.createNewIO(Connection.java:2741)
  at com.mysql.jdbc.Connection.<init>(Connection.java:1531)
  at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:266)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:61)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
  ... 58 elided

然后是试验2,

val partitionColumn = "id"
val dbUrl = "jdbc:mysql://our-host.readm.co.nz:3306"

val lowerBound = 1
val upperBound = 41514638

val jdbcDF = spark.read.format("jdbc")
    .option("url", dbUrl)
    .option("databaseName", "testdb")
    .option("dbtable", "tablename")
    .option("user", "username")
    .option("password", "*****")
    .option("driver","com.mysql.jdbc.Driver")
    .option("partitionColumn", partitionColumn)
    .option("lowerBound", lowerBound)
    .option("upperBound", upperBound)
    .option("numPartitions", 20)
    .load()

得到这个错误,

java.lang.NullPointerException
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:72)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
  ... 58 elided

试验 3,

import java.util.Properties

val table = "tablename"
val url = "jdbc:mysql://our-host.readm.co.nz"
val prop = new Properties()
prop.put("driver","com.mysql.jdbc.Driver")
prop.put("user","username")
prop.put("password","****")
prop.put("databaseName","testdb")

val jdbcDF = spark.read.jdbc(url,
        table,
        prop)

得到这个错误,

java.lang.NullPointerException
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:72)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
  at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
  ... 58 elided

不确定这里有什么问题,可以帮助我吗?谢谢。

编辑:- 这是我在尝试@geo 给出的答案后得到的结果

com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
** BEGIN NESTED EXCEPTION **
java.net.SocketException
MESSAGE: java.net.ConnectException: Connection timed out (Connection timed out)
STACKTRACE:
java.net.SocketException: java.net.ConnectException: Connection timed out (Connection timed out)
    at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:156)
    at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:276)
    at com.mysql.jdbc.Connection.createNewIO(Connection.java:2666)
    at com.mysql.jdbc.Connection.<init>(Connection.java:1531)
    at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:266)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:61)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
    at <init>(<console>:40)
    at <init>(<console>:45)
    at <init>(<console>:47)
    at <init>(<console>:49)
    at <init>(<console>:51)
    at <init>(<console>:53)
    at <init>(<console>:55)
    at <init>(<console>:57)
    at <init>(<console>:59)
    at <init>(<console>:61)
    at <init>(<console>:63)
    at <init>(<console>:65)
    at <init>(<console>:67)
    at .<init>(<console>:71)
    at .<clinit>(<console>)
    at .$print$lzycompute(<console>:7)
    at .$print(<console>:6)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
    at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
    at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
    at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
    at sun.reflect.GeneratedMethodAccessor340.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1000)
    at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:1205)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1172)
    at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:1165)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:498)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
    at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
** END NESTED EXCEPTION **
Last packet sent to the server was 7 ms ago.
  at com.mysql.jdbc.Connection.createNewIO(Connection.java:2741)
  at com.mysql.jdbc.Connection.<init>(Connection.java:1531)
  at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:266)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:61)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:52)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
  ... 50 elided

最佳答案

假设 JDBC 解释器在 Zeppelin 中没有任何问题,这对我有用:

// DB connection
val options = Map(
    "url"       -> "jdbc:mysql://localhost:3306/dbname",
    "dbtable"   -> "dbtable",
    "user"      -> "user",
    "password"  -> "password"
)

val data = spark
    .read
    .format("jdbc")
    .options(options)
    .load

不言而喻,应该有一个 mysql 进程运行在 localhost:3306(或您托管数据库的任何地方),并且防火墙/iptables 允许您进出访问/从那个过程。

关于mysql - Spark - 通过 Zeppelin EMR 连接到 mysql,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50574815/

相关文章:

mysql - 创建查询以显示长文本字段中特定结果的计数

javascript - 比较 javascript 元素和 scala 变量的 Play 框架 Twirl 模板

scala - 如何在 Scala 中导入部分为 "type"的包?

MYSQL:在 mysql 服务器中使用多个 order by

mysql - 如何查询由另一个 pk 描述的 2 个表到 pk 表的关系?

mysql - 如何使用 Sequelize 在我的时区存储正确的日期时间

scala - 无法使用 IntelliJ SBT 控制台导入

java.io.FileNotFoundException : localhost/broadcast_1

scala - Apache Spark 3 和向后兼容性?

python - 检查类型 : How to check if something is a RDD or a DataFrame?