scala - How DataFrame API depends on RDDs in Spark? -
some sources, this keynote: spark 2.0 talk mathei zaharia, mention spark dataframes built on top of rdds. have found mentions on rdds in dataframe class (in spark 2.0 i'd have @ dataset); still have limited understanding of how these 2 apis bound behind scenes.
can explain how dataframes extend rdds if do?
according databricks article deep dive spark sql’s catalyst optimizer (see using catalyst in spark sql), rdds elements of physical plan built catalyst. so, describe queries in terms of dataframes, in end, spark operates on rdds.
also, can view physical plan of query using explain
instruction.
// prints physical plan console debugging purpose auction.select("auctionid").distinct.explain() // == physical plan == // distinct false // exchange (hashpartitioning [auctionid#0], 200) // distinct true // project [auctionid#0] // physicalrdd //[auctionid#0,bid#1,bidtime#2,bidder#3,bidderrate#4,openbid#5,price#6,item#7,daystolive#8], mappartitionsrdd[11] @ mappartitions @ existingrdd.scala:37
Comments
Post a Comment