orientdb2.2 - Performance tuning for loading Gigabytes of data in OrientDB -


i used etl tool insert bunch of csv data orientdb. system configuration used trial purpose ec2 m3 large ( 7.5 gib of memory, 2 vcpus, 32 gb of ssd-based local instance storage, 64-bit platform ).

the data(masked) i'm trying upload of below format :

"101.186.130.130","527225725","233 djfnsdkj","0.119836317542"  "125.143.534.148","112212983","1227 sdfsdfds","0.0465215171983"  "103.149.957.752","112364761","1121 sdfsdfds","0.0938863016658"  "103.190.245.128","785804692","6138 sdfsdfsd","0.117767539364" 

the schema contains 2 node classes , 1 edge class. when tried loading data using etl tool in plocal option speed 2300 rows / second. etl configuration mentioned below :

{    "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/0001_part_00" } }, "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string", "dpcb:string", "address:string", "prob:string"] } },   "transformers": [ { "merge": { "joinfieldname":"ip", "lookup":"ipaddress.ip" } }, { "field":   { "fieldname": "addr_key",      "expression": "dpcb.append('_').append(address)" } },{ "vertex": { "class": "ipaddress" } }, { "edge": { "class": "located",             "joinfieldname": "addr_key",             "lookup": "phylocation.loc",             "direction": "out",             "targetvertexfields": { "geo_address": "${input.address}", "dpcb_number": "${input.dpcb}"},             "edgefields": { "confidence": "${input.prob}" },             "unresolvedlinkaction": "create"         }     } ], "loader": { "orientdb": {    "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test1",    "dbtype": "graph",    "dbuser": "admin",    "dbpassword": "admin",    "serveruser": "admin",    "wal": false,    "serverpassword":"admin",    "classes": [      {"name": "ipaddress", "extends": "v"},      {"name": "phylocation", "extends": "v"},      {"name": "located", "extends": "e"}    ], "indexes": [      {"class":"ipaddress", "fields":["ip:string"], "type":"unique" },      {"class":"phylocation", "fields":["loc:string"], "type":"unique" }    ] }   } } 

then separated vertices files , ran etl job vertices, time speed close 12500 rows / second. reasonably fast , kind of works me. ( when removed indexes speed doubled) config used :

{   "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/only_ip_05.csv" } },   "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string"] } },   "transformers": [ { "vertex": { "class": "ipaddress" } }],   "loader": { "orientdb": {    "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test7",    "dbtype": "graph",    "dbuser": "admin",    "dbpassword": "admin",    "serveruser": "admin",    "wal": false,    "serverpassword":"admin",    "classes": [      {"name": "ipaddress", "extends": "v"}    ],    "indexes": [      {"class":"ipaddress", "fields":["ip:string"], "type":"unique" }    ] } } } 

however when tried insert edges alone speed became extremely slow @ 2200 rows / second. turned out lower running entire operation within 1 run. config file attached below :

{   "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/edge5.csv" } },   "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string", "loc:string", "prob:string"] } },   "transformers": [ { "merge": { "joinfieldname":"ip", "lookup":"ipaddress.ip" } }, { "vertex": { "class" : "ipaddress", "skipduplicates" : true }}, { "edge": { "class": "located",             "joinfieldname": "loc",             "lookup": "phylocation.loc",             "direction": "out",             "edgefields": { "confidence": "${input.prob}" },             "unresolvedlinkaction": "nothing"         }     }  ],   "loader": { "orientdb": {    "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test7",    "dbtype": "graph",    "dbuser": "admin",    "dbpassword": "admin",    "serveruser": "admin",    "wal": false,    "tx":false,    "batchcommit":10000,    "serverpassword":"admin",    "classes": [      {"name": "ipaddress", "extends": "v"},      {"name": "phylocation", "extends": "v"},      {"name": "located", "extends": "e"}    ] }   } } 

please can let me know if i'm doing wrong here, please suggest better ways performance improvement


Comments

Popular posts from this blog

How to use SUM() in MySQL for calculated values -

loops - Spock: How to use test data with @Stepwise -