scala - Is forcing action on spark dataframe required? -
i have below code snippet, working fine.
so here 2 actions in code. updatedf.first()
, mergeddf.save()
updatedf.count()
dummy action, if remove job failing. necessary force such kind of action in code. feeling if remove updatedf.count()
, first encounter action on mergeddf.save()
. , when compute mergeddf.save()
creates more intermediary dataframes causing job failed. suggest code change make better.
newdatadf.persist() val historydatadf=hivecontext.read.format("orc").load(stagingfullpath).persist() val updatedf = historydatadf.coalesce(5).join(newdatadf, jobprimarykey).select(historydatadf.columns.map(historydatadf(_)): _*).persist() println(updatedf.count()) val unchangeddf = historydatadf.except(updatedf).persist() val mergeddf = unchangeddf.unionall(newdatadf).persist() mergeddf.write.format("orc").mode(org.apache.spark.sql.savemode.overwrite).save(stagingfullpath)
Comments
Post a Comment