Understanding caching, persisting in Spark -
can 1 please correct understanding on persisting spark.
if have performed cache() on rdd value cached on nodes rdd computed initially. meaning, if there cluster of 100 nodes, , rdd computed in partitions of first , second nodes. if cached rdd, spark going cache value in first or second worker nodes. when spark application trying use rdd in later stages, spark driver has value first/second nodes.
am correct?
(or)
is rdd value persisted in driver memory , not on nodes ?
change this:
then spark going cache value in first or second worker nodes.
to this:
then spark going cache value in first and second worker nodes.
and...yes correct!
spark tries minimize memory usage (and love that!), won't make unnecessary memory loads, since evaluates every statement lazily, i.e. won't actual work on transformation, wait action happen, leaves no choice spark, actual work (read file, communicate data network, computation, collect result driver, example..).
you see, don't want cache everything, unless can (that memory capacity allows (yes, can ask more memory in executors or/and driver, our cluster doesn't have resources, common when handle big data) , makes sense, i.e. cached rdd going used again , again (so caching speedup execution of our job).
that's why want unpersist()
rdd, when no longer need it...! :)
check pic, 1 of jobs, had requested 100 executors, executors tab displayed 101, i.e. 100 slaves/workers , 1 master/driver:
Comments
Post a Comment