Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Persisting an updated dataframe back in data grid fails with insightedge

I am using the community edition of insightedge. Through the insightedge-shell, I am trying to execute a scala script that tries to overwrite the cache in the data-grid after adding a new column to the dataframe. The script does not fail with errors but rather returns an empty dataframe after save or reload from the data grid.

Below is the code

case class Data(COL1: Int, COL2: Int, COL3: Int)
val data = Seq(Seq(1, 2, 3), Seq(7, 8, 9), Seq(9, 2, 3), Seq(4, 2, 3), Seq(5, 6, 7))
val rdd = sc.parallelize(data, data.length).map(s => new Data(s(0), s(1), s(2)))

val threeColDF = sqlContext.createDataFrame(rdd)

import org.insightedge.spark.implicits.all._
import org.apache.spark.sql._
threeColDF.write.grid.mode(SaveMode.Overwrite).save("numseq")

val persisted3ColDF = sqlContext.read.grid.load("numseq")
persisted3ColDF.show()

val fourColDF = persisted3ColDF.withColumn("COL4", persisted3ColDF("COL3") + 1)
println("Printing modified dataframe before save to the data-grid")
fourColDF.show()
fourColDF.write.grid.mode(SaveMode.Overwrite).save("numseq")
println("Printing modified dataframe after save to the data-grid")
fourColDF.show()

val persisted4ColDF = sqlContext.read.grid.load("numseq")
println("Printing modified dataframe after reloading it from the data-grid")
persisted4ColDF.show()

Output

persisted3ColDF: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int, COL3: int]
+----+----+----+
|COL1|COL2|COL3|
+----+----+----+
|   9|   2|   3|
|   7|   8|   9|
|   5|   6|   7|
|   1|   2|   3|
|   4|   2|   3|
+----+----+----+

fourColDF: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int, COL3: int, COL4: int]
Printing modified dataframe before save to the data-grid
+----+----+----+----+
|COL1|COL2|COL3|COL4|
+----+----+----+----+
|   1|   2|   3|   4|
|   4|   2|   3|   4|
|   9|   2|   3|   4|
|   7|   8|   9|  10|
|   5|   6|   7|   8|
+----+----+----+----+

Printing modified dataframe after save to the data-grid
+----+----+----+----+
|COL1|COL2|COL3|COL4|
+----+----+----+----+
+----+----+----+----+

persisted4ColDF: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int, COL3: int, COL4: int]
Printing modified dataframe after reloading it from the data-grid
+----+----+----+----+
|COL1|COL2|COL3|COL4|
+----+----+----+----+
+----+----+----+----+

asked 2017-01-24 03:25:47 -0600

shahamit gravatar image
edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

Hello Amit,

Problem is at fourColDF.write.grid.mode(SaveMode.Overwrite).save("numseq").

Reason you are trying to overwrite numseq object with different definition, as original class has 3 columns, where as new DataFrame has four columns.

As you can understand Spark DataFrame is Immutable so threeColDF and fourColDF are 2 different type of objects with different definition, whereas in DataGrid numseq is mutable and you are trying override.

I would suggest to add following code to overcome your problem, it will ends up writing 2 different classes with different Data Type Name:

case class DataWith4Colums(COL1: Int, COL2: Int, COL3: Int, COL4: Int)
val fouColDFwithNewClass = fourColDF.map(data => new DataWith4Colums(data.getAs[Int]("COL1"), data.getAs[Int]("COL2"), data.getAs[Int]("COL3"), data.getAs[Int]("COL4"))).toDF()
println("Printing modified dataframe before save to the data-grid")
fouColDFwithNewClass.show()
fouColDFwithNewClass.write.grid.mode(SaveMode.Overwrite).save("numseq4")
println("Printing modified dataframe after save to the data-grid")
fouColDFwithNewClass.show()

output will be:

fouColDFwithNewClass: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int, COL3: int, COL4: int]

Printing modified dataframe before save to the data-grid

+----+----+----+----+
|COL1|COL2|COL3|COL4|
+----+----+----+----+
|   7|   8|   9|  10|
|   9|   2|   3|   4|
|   1|   2|   3|   4|
|   4|   2|   3|   4|
|   5|   6|   7|   8|
+----+----+----+----+

Printing modified dataframe after save to the data-grid

+----+----+----+----+
|COL1|COL2|COL3|COL4|
+----+----+----+----+
|   7|   8|   9|  10|
|   9|   2|   3|   4|
|   1|   2|   3|   4|
|   4|   2|   3|   4|
|   5|   6|   7|   8|
+----+----+----+----+

Regards, Rajiv Shah

answered 2017-04-27 18:41:16 -0600

rajiv gravatar image

updated 2017-04-28 04:49:54 -0600

shay hassidim gravatar image
edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2017-01-24 03:25:47 -0600

Seen: 63 times

Last updated: Apr 28