Optimal database design in terms of query speed to store matrices from R -
i have hundreds of matrices need used in r , of them around 45000x350 each. i'd find optimal database software choice , schema store data in , able call subsets of matrices database. needs done object of extracting data fast possible.
as base here code creates 5 matrices similar i'm dealing with:
if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") } require("zoo", quietly=true) numsymbols <- 45000 numvariables <- 5 rdatepattern <- "%d/%m/%y" startdate <- "31/12/1982" enddate <- "30/09/2011" startyearmonth <- as.yearmon(startdate,format=rdatepattern) alphanumeric <- c(1:9,toupper(letters)) nummonths <- (as.yearmon(enddate,format=rdatepattern)-startyearmonth)*12 numvalues <- numsymbols*(nummonths+1) datevector <- sapply(1:(nummonths+1), function(x) {as.character(format(as.date(startyearmonth+x*1/12,fraq=0)-1,rdatepattern))}) symbolnames <- sapply(1:numsymbols,function(x) {as.character(paste((sample(alphanumeric,7)),collapse=""))}) for(i in 1:numvariables) { assign(paste("variable",i,sep="_"),matrix(sample(c(rnorm(numvalues/2),rep(na,numvalues/2))), nrow=numsymbols, ncol=(nummonths+1), dimnames=list(symbolnames,datevector))) }
basically matrices have half values filled doubles , rest nas.
# > ls()[grepl("variable_",ls())] # [1] "variable_1" "variable_2" "variable_3" "variable_4" "variable_5" # > dim(variable_1) # [1] 45000 346 # > variable_1[1:10,1:3] # 31/12/1982 31/01/1983 28/02/1983 # af3hm5v na na -1.15076100366945755 # dl8tviy na na -1.59412257037490046 # jefdypo na na na # 21zv689 na na -0.31095014405320764 # rp1dzhb -1.0571670785223215 -0.7206356272944392 -0.84028668343265112 # n6duszc na na -1.31113363079930023 # xg3za1w na 0.8531074740045220 0.06797987526470438 # w1jcxie 0.2782029710832690 -1.2668560986048898 na # x3rkt2b 1.5220172324681460 -1.0460218516729356 na # 3eub8vn -0.9405417187846803 1.1151437940206490 1.60520458945005262
i want able store these in database. rdbms default option i'm willing @ other options. biggest part optimal solution quick querying, whole matrix or subset of matrix, e.g. 2000 symbols, 100 dates etc.
the current solution i've been using saving each matrix rdata file , loading in whole matrix , truncating in size use. quick feel database design more benefitial in terms of scaling matrices in terms of symbols + dates , backups data.
what i've tried far in terms of rdbms options is:
a)
- fields: symbol, variable, date, value - seperate , clustered indices value. - data needs "melted"/pivoted mxn matrix r (crazy memory inefficient) - average query normal sample r: 4-8 minutes
b)
- each variable in sperate table. - fields: symbol, date, value - seperate , clustered indices value. - views added cache common subsets (dunno if helped @ all...) - data needs "melted"/pivoted mxn matrix r (crazy memory inefficient) - average query normal sample r: 3-5 minutes
c) [should maybe have tried column based database here]
- symbols , dates stored seperately , map row , col numbers - each variable in seperate table symbols rows , dates columns - bad data maps disk when scaling rows , cols. - data in correct format r - average query normal sample r: 1-3 minutes
in comparison above database set ups loading in whole variable rdata files takes 5 sec locally , 20 sec on network. database times on network.
is there can make database route come anywhere close binary file speeds?
maybe 1 of tabular nosql databases need?
how scale in terms of additional symbols + dates?
any who's dealt similar appreciated.
update: thought i'd post update this. in end went iterator's suggestion , data hosted in bigmemory memory mapped files , rdata quick use drag , drop checking outputted csv , pulled sql server backup purposes. database solution slow used multiple users. using rodbc against sql server crazy slow, tried input , output r via csv , sql , okay pointless.
also references, byte compiling the load method bigmemory have impact. here results of load test rdata vs bigmemory.
workingdirectory <- "/users/hans/92 speed test/" require("bigmemory") require("compiler") require("rbenchmark") loadvariablesinfolder <- function(folder, sedols, dates) { filesinfolder <- dir(folder) filestoload <- filesinfolder[grepl(".*nvar_.*\\.rdata",filesinfolder)] filestoload <- paste(folder,filestoload,sep="/") variablesthatwereloaded <- c() for(ftoload in filestoload) { loadedvar <- load(ftoload) assign(loadedvar,get(loadedvar)[sedols,dates]) gc() -> ans variablesthatwereloaded <- c(variablesthatwereloaded,loadedvar) rm(list=c(loadedvar)) } return(variablesthatwereloaded) } cloadvariablesinfolder <- cmpfun(loadvariablesinfolder) bigmloadvariablesinfolder <- function(folder, sedols, dates) { workd <- getwd() setwd(folder) filesinfolder <- dir(folder) filestoload <- filesinfolder[grepl(".*nvar_.*\\.desc",filesinfolder)] variablesthatwereloaded <- c() for(ftoload in filestoload) { tempvar <- attach.big.matrix(dget(ftoload)) loadedvar <- gsub(".*(nvar_\\d+).*","\\1",ftoload,perl=true) assign(loadedvar,tempvar[sedols,dates]) variablesthatwereloaded <- c(variablesthatwereloaded,loadedvar) rm(list=c(loadedvar,"tempvar")) gc() -> ans } setwd(workd) return(variablesthatwereloaded) } cbigmloadvariablesinfolder <- cmpfun(bigmloadvariablesinfolder) testcases <- list( list(numsedols=1000,numdates=120), list(numsedols=5000,numdates=120), list(numsedols=50000,numdates=120), list(numsedols=1000,numdates=350), list(numsedols=5000,numdates=350), list(numsedols=50000,numdates=350)) load(paste(workingdirectory,"dates.cache",sep="/")) load(paste(workingdirectory,"sedols.cache",sep="/")) (testcase in testcases) { results <- benchmark(loadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]), cloadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]), bigmloadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]), cbigmloadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]), columns=c("test", "replications","elapsed", "relative"), order="relative", replications=3) cat("results testcase:\n") print(testcase) print(results) }
basically smaller subset more gained because don't spend time loading in whole matrix. loading whole matrix slower bigmemory rdata, guess it's conversion overhead:
# results testcase: # $numsedols # [1] 1000 # $numdates # [1] 120 # test # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 3 bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1 loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2 cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # replications elapsed relative # 4 3 6.799999999999955 1.000000000000000 # 3 3 14.389999999999986 2.116176470588247 # 1 3 235.639999999999986 34.652941176470819 # 2 3 250.590000000000032 36.851470588235543 # results testcase: # $numsedols # [1] 5000 # $numdates # [1] 120 # test # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 3 bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1 loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2 cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # replications elapsed relative # 4 3 7.080000000000155 1.000000000000000 # 3 3 32.730000000000018 4.622881355932105 # 1 3 249.389999999999873 35.224576271185654 # 2 3 254.909999999999854 36.004237288134789 # results testcase: # $numsedols # [1] 50000 # $numdates # [1] 120 # test # 3 bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2 cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1 loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # replications elapsed relative # 3 3 146.3499999999999 1.000000000000000 # 4 3 148.1799999999998 1.012504270584215 # 2 3 238.3200000000002 1.628425008541171 # 1 3 240.4600000000000 1.643047488896482 # results testcase: # $numsedols # [1] 1000 # $numdates # [1] 350 # test # 3 bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1 loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2 cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # replications elapsed relative # 3 3 83.88000000000011 1.000000000000000 # 4 3 91.71000000000004 1.093347639484977 # 1 3 235.69000000000005 2.809847401049115 # 2 3 240.79999999999973 2.870767763471619 # results testcase: # $numsedols # [1] 5000 # $numdates # [1] 350 # test # 3 bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2 cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1 loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # replications elapsed relative # 3 3 135.6999999999998 1.000000000000000 # 4 3 155.8900000000003 1.148784082535008 # 2 3 233.3699999999999 1.719749447310245 # 1 3 240.5599999999995 1.772733971997051 # results testcase: # $numsedols # [1] 50000 # $numdates # [1] 350 # test # 2 cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1 loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 3 bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # replications elapsed relative # 2 3 236.5000000000000 1.000000000000000 # 1 3 237.2100000000000 1.003002114164905 # 3 3 388.2900000000000 1.641818181818182 # 4 3 393.6300000000001 1.664397463002115
i recommend using hdf5. assume data complex enough variety of bigmemory
files (i.e. memory mapped matrices) not satisfy needs (see note 1), hdf5 short of speed of memory mapped files. see this longer answer question understand how compare hdf5 , .rdat files.
most notably, fact hdf5 supports random access means should able substantial speed improvements.
another option, depending on willingness design own binary format, use readbin
, writebin
, though doesn't have of nice features hdf5 has, including parallel i/o, version information, portability, etc.
note 1: if have few types per row, i.e. 1 character , rest numeric, can create 2 memory mapped matrices, 1 of characters, other numeric values. allow use bigmemory
, mwhich
, bigtabulate
, lots of other nice functions in bigmemory
suite. i'd give reasonable effort, it's easy system smoothly integrate lots of r code: matrix need never enter memory, whatever subsets happen need, , many instances can access same files simultaneously. what's more, easy parallelize access using multicore backends foreach()
. used have operation take 3 minutes per .rdat file: 2 minutes load, 20 seconds subselect needed, 10 seconds analyze, , 30 seconds save results. after switching bigmemory
, got down 10 seconds analyze , 5-15 seconds on i/o.
update 1: overlooked ff package - option, though lot more complex bigmemory.
Comments
Post a Comment