Understanding caching, persisting in Spark -
can 1 please correct understanding on persisting spark. if have performed cache() on rdd value cached on nodes rdd computed initially. meaning, if there cluster of 100 nodes, , rdd computed in partitions of first , second nodes. if cached rdd, spark going cache value in first or second worker nodes. when spark application trying use rdd in later stages, spark driver has value first/second nodes. am correct? (or) is rdd value persisted in driver memory , not on nodes ? change this: then spark going cache value in first or second worker nodes. to this: then spark going cache value in first and second worker nodes. and... yes correct! spark tries minimize memory usage (and love that!), won't make unnecessary memory loads, since evaluates every statement lazily , i.e. won't actual work on transformation , wait action happen, leaves no choice spark, actual work (read file, communicate data network, computation, collect result driver, exampl...