scala - Is forcing action on spark dataframe required? -


i have below code snippet, working fine.

so here 2 actions in code. updatedf.first() , mergeddf.save() updatedf.count() dummy action, if remove job failing. necessary force such kind of action in code. feeling if remove updatedf.count(), first encounter action on mergeddf.save(). , when compute mergeddf.save() creates more intermediary dataframes causing job failed. suggest code change make better.

newdatadf.persist() val historydatadf=hivecontext.read.format("orc").load(stagingfullpath).persist()  val updatedf = historydatadf.coalesce(5).join(newdatadf, jobprimarykey).select(historydatadf.columns.map(historydatadf(_)): _*).persist()  println(updatedf.count())  val unchangeddf = historydatadf.except(updatedf).persist()  val mergeddf = unchangeddf.unionall(newdatadf).persist() mergeddf.write.format("orc").mode(org.apache.spark.sql.savemode.overwrite).save(stagingfullpath) 


Comments

Popular posts from this blog

amazon web services - S3 Pre-signed POST validate file type? -

c# - Check Keyboard Input Winforms -