orientdb2.2 - Performance tuning for loading Gigabytes of data in OrientDB -
i used etl tool insert bunch of csv data orientdb. system configuration used trial purpose ec2 m3 large ( 7.5 gib of memory, 2 vcpus, 32 gb of ssd-based local instance storage, 64-bit platform ).
the data(masked) i'm trying upload of below format :
"101.186.130.130","527225725","233 djfnsdkj","0.119836317542" "125.143.534.148","112212983","1227 sdfsdfds","0.0465215171983" "103.149.957.752","112364761","1121 sdfsdfds","0.0938863016658" "103.190.245.128","785804692","6138 sdfsdfsd","0.117767539364"
the schema contains 2 node classes , 1 edge class. when tried loading data using etl tool in plocal option speed 2300 rows / second. etl configuration mentioned below :
{ "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/0001_part_00" } }, "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string", "dpcb:string", "address:string", "prob:string"] } }, "transformers": [ { "merge": { "joinfieldname":"ip", "lookup":"ipaddress.ip" } }, { "field": { "fieldname": "addr_key", "expression": "dpcb.append('_').append(address)" } },{ "vertex": { "class": "ipaddress" } }, { "edge": { "class": "located", "joinfieldname": "addr_key", "lookup": "phylocation.loc", "direction": "out", "targetvertexfields": { "geo_address": "${input.address}", "dpcb_number": "${input.dpcb}"}, "edgefields": { "confidence": "${input.prob}" }, "unresolvedlinkaction": "create" } } ], "loader": { "orientdb": { "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test1", "dbtype": "graph", "dbuser": "admin", "dbpassword": "admin", "serveruser": "admin", "wal": false, "serverpassword":"admin", "classes": [ {"name": "ipaddress", "extends": "v"}, {"name": "phylocation", "extends": "v"}, {"name": "located", "extends": "e"} ], "indexes": [ {"class":"ipaddress", "fields":["ip:string"], "type":"unique" }, {"class":"phylocation", "fields":["loc:string"], "type":"unique" } ] } } }
then separated vertices files , ran etl job vertices, time speed close 12500 rows / second. reasonably fast , kind of works me. ( when removed indexes speed doubled) config used :
{ "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/only_ip_05.csv" } }, "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string"] } }, "transformers": [ { "vertex": { "class": "ipaddress" } }], "loader": { "orientdb": { "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test7", "dbtype": "graph", "dbuser": "admin", "dbpassword": "admin", "serveruser": "admin", "wal": false, "serverpassword":"admin", "classes": [ {"name": "ipaddress", "extends": "v"} ], "indexes": [ {"class":"ipaddress", "fields":["ip:string"], "type":"unique" } ] } } }
however when tried insert edges alone speed became extremely slow @ 2200 rows / second. turned out lower running entire operation within 1 run. config file attached below :
{ "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/edge5.csv" } }, "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string", "loc:string", "prob:string"] } }, "transformers": [ { "merge": { "joinfieldname":"ip", "lookup":"ipaddress.ip" } }, { "vertex": { "class" : "ipaddress", "skipduplicates" : true }}, { "edge": { "class": "located", "joinfieldname": "loc", "lookup": "phylocation.loc", "direction": "out", "edgefields": { "confidence": "${input.prob}" }, "unresolvedlinkaction": "nothing" } } ], "loader": { "orientdb": { "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test7", "dbtype": "graph", "dbuser": "admin", "dbpassword": "admin", "serveruser": "admin", "wal": false, "tx":false, "batchcommit":10000, "serverpassword":"admin", "classes": [ {"name": "ipaddress", "extends": "v"}, {"name": "phylocation", "extends": "v"}, {"name": "located", "extends": "e"} ] } } }
please can let me know if i'm doing wrong here, please suggest better ways performance improvement
Comments
Post a Comment