scala - How DataFrame API depends on RDDs in Spark? -


some sources, this keynote: spark 2.0 talk mathei zaharia, mention spark dataframes built on top of rdds. have found mentions on rdds in dataframe class (in spark 2.0 i'd have @ dataset); still have limited understanding of how these 2 apis bound behind scenes.

can explain how dataframes extend rdds if do?

according databricks article deep dive spark sql’s catalyst optimizer (see using catalyst in spark sql), rdds elements of physical plan built catalyst. so, describe queries in terms of dataframes, in end, spark operates on rdds.

catalyst workflow

also, can view physical plan of query using explain instruction.

//  prints physical plan console debugging purpose auction.select("auctionid").distinct.explain()  // == physical plan == // distinct false // exchange (hashpartitioning [auctionid#0], 200) //  distinct true //   project [auctionid#0]  //   physicalrdd   //[auctionid#0,bid#1,bidtime#2,bidder#3,bidderrate#4,openbid#5,price#6,item#7,daystolive#8], mappartitionsrdd[11] @ mappartitions @ existingrdd.scala:37 

Comments

Popular posts from this blog

amazon web services - S3 Pre-signed POST validate file type? -

c# - Check Keyboard Input Winforms -