Java Huge csv file processing and storing using Apache Spark/ Kafka/ Storm to Cassandra -

i working on requirement need read sensor data csv/tsv , insert cassandra db.

csv format:

sensor1 timestamp1 value
sensor1 timestamp2 value
sensor2 timestamp1 value
sensor2 timestamp3 value

details:

user can upload file our web application. once file uploaded, need display unique values column user in next page. for example ->

sensor1 node1
sensor2 node2
sensorn create

user can either map sensor1 existing primary key called node1, in case timestamps , values sensor1 added table primary key equal node1 or create primary key, in case timestamps , values added new primary key.

i able implement using java8 streaming , collection. working small csv file.

question:

how can upload huge csv/ tsv file (200 gb) web application? shall upload file in hdfs , specify path in ui? have split huge file small chunks (50 mb each).
how can unique values first column? can use kafka/ spark here? need insert timestamp/ value cassandra db. again can use kafka/ spark here?

any highly appreciated.

how can upload huge csv/ tsv file (200 gb) web application? shall upload file in hdfs , specify path in ui? have split huge file small chunks (50 mb each).

depends on how web app going used. uploading file of such huge size during context of http request client server going tricky. have asynchronously. whether put in hdfs or s3 or simple sftp server matter of design choice , choice affect kinds of tools want build around file. suggest start simple ftp/nas , have needs scale, use s3. (using hdfs shared file storage haven't seen many people do, shouldn't prohibit trying)

how can unique values first column? can use kafka/ spark here? need insert timestamp/ value cassandra db. again can use kafka/ spark here?

spark batch or normal m/r job trick you. simple groupby operation, though should @ how far willing sacrifice on latency, groupby operations costly (it involves shuffles). generally, limited experience, using streaming use-cases overkill, unless continuous stream of source data. way have described use-case looks more batch candidate me.

some things focus on: how transfer file client app, end-to-end slas availability of data in cassandra, happens when there failures (do retry, etc.), how jobs run (will triggered every time user uploads file or can cron job), etc.

Search This Blog

Facebook Talkie