orientdb2.2 - Performance tuning for loading Gigabytes of data in OrientDB -


i used etl tool insert bunch of csv data orientdb. system configuration used trial purpose ec2 m3 large ( 7.5 gib of memory, 2 vcpus, 32 gb of ssd-based local instance storage, 64-bit platform ).

the data(masked) i'm trying upload of below format :

"101.186.130.130","527225725","233 djfnsdkj","0.119836317542"  "125.143.534.148","112212983","1227 sdfsdfds","0.0465215171983"  "103.149.957.752","112364761","1121 sdfsdfds","0.0938863016658"  "103.190.245.128","785804692","6138 sdfsdfsd","0.117767539364" 

the schema contains 2 node classes , 1 edge class. when tried loading data using etl tool in plocal option speed 2300 rows / second. etl configuration mentioned below :

{    "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/0001_part_00" } }, "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string", "dpcb:string", "address:string", "prob:string"] } },   "transformers": [ { "merge": { "joinfieldname":"ip", "lookup":"ipaddress.ip" } }, { "field":   { "fieldname": "addr_key",      "expression": "dpcb.append('_').append(address)" } },{ "vertex": { "class": "ipaddress" } }, { "edge": { "class": "located",             "joinfieldname": "addr_key",             "lookup": "phylocation.loc",             "direction": "out",             "targetvertexfields": { "geo_address": "${input.address}", "dpcb_number": "${input.dpcb}"},             "edgefields": { "confidence": "${input.prob}" },             "unresolvedlinkaction": "create"         }     } ], "loader": { "orientdb": {    "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test1",    "dbtype": "graph",    "dbuser": "admin",    "dbpassword": "admin",    "serveruser": "admin",    "wal": false,    "serverpassword":"admin",    "classes": [      {"name": "ipaddress", "extends": "v"},      {"name": "phylocation", "extends": "v"},      {"name": "located", "extends": "e"}    ], "indexes": [      {"class":"ipaddress", "fields":["ip:string"], "type":"unique" },      {"class":"phylocation", "fields":["loc:string"], "type":"unique" }    ] }   } } 

then separated vertices files , ran etl job vertices, time speed close 12500 rows / second. reasonably fast , kind of works me. ( when removed indexes speed doubled) config used :

{   "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/only_ip_05.csv" } },   "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string"] } },   "transformers": [ { "vertex": { "class": "ipaddress" } }],   "loader": { "orientdb": {    "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test7",    "dbtype": "graph",    "dbuser": "admin",    "dbpassword": "admin",    "serveruser": "admin",    "wal": false,    "serverpassword":"admin",    "classes": [      {"name": "ipaddress", "extends": "v"}    ],    "indexes": [      {"class":"ipaddress", "fields":["ip:string"], "type":"unique" }    ] } } } 

however when tried insert edges alone speed became extremely slow @ 2200 rows / second. turned out lower running entire operation within 1 run. config file attached below :

{   "source": { "file": { "path": "/home/ubuntu/labvolume1/orientdb/bin/edge5.csv" } },   "extractor": { "csv": {"columnsonfirstline": false, "columns":["ip:string", "loc:string", "prob:string"] } },   "transformers": [ { "merge": { "joinfieldname":"ip", "lookup":"ipaddress.ip" } }, { "vertex": { "class" : "ipaddress", "skipduplicates" : true }}, { "edge": { "class": "located",             "joinfieldname": "loc",             "lookup": "phylocation.loc",             "direction": "out",             "edgefields": { "confidence": "${input.prob}" },             "unresolvedlinkaction": "nothing"         }     }  ],   "loader": { "orientdb": {    "dburl": "plocal:/home/ubuntu/labvolume1/orientdb/databases/bulk_transfer_test7",    "dbtype": "graph",    "dbuser": "admin",    "dbpassword": "admin",    "serveruser": "admin",    "wal": false,    "tx":false,    "batchcommit":10000,    "serverpassword":"admin",    "classes": [      {"name": "ipaddress", "extends": "v"},      {"name": "phylocation", "extends": "v"},      {"name": "located", "extends": "e"}    ] }   } } 

please can let me know if i'm doing wrong here, please suggest better ways performance improvement


Comments

Popular posts from this blog

amazon web services - S3 Pre-signed POST validate file type? -

c# - Check Keyboard Input Winforms -