Optimal database design in terms of query speed to store matrices from R -

i have hundreds of matrices need used in r , of them around 45000x350 each. i'd find optimal database software choice , schema store data in , able call subsets of matrices database. needs done object of extracting data fast possible.

as base here code creates 5 matrices similar i'm dealing with:

if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") } require("zoo", quietly=true)  numsymbols <- 45000 numvariables <- 5 rdatepattern <- "%d/%m/%y" startdate <- "31/12/1982" enddate <- "30/09/2011" startyearmonth <- as.yearmon(startdate,format=rdatepattern) alphanumeric <- c(1:9,toupper(letters))              nummonths <- (as.yearmon(enddate,format=rdatepattern)-startyearmonth)*12 numvalues <- numsymbols*(nummonths+1)  datevector <- sapply(1:(nummonths+1), function(x) {as.character(format(as.date(startyearmonth+x*1/12,fraq=0)-1,rdatepattern))})  symbolnames <- sapply(1:numsymbols,function(x) {as.character(paste((sample(alphanumeric,7)),collapse=""))})  for(i in 1:numvariables) {     assign(paste("variable",i,sep="_"),matrix(sample(c(rnorm(numvalues/2),rep(na,numvalues/2))),                                               nrow=numsymbols,                                               ncol=(nummonths+1),                                               dimnames=list(symbolnames,datevector))) }

basically matrices have half values filled doubles , rest nas.

# > ls()[grepl("variable_",ls())] # [1] "variable_1" "variable_2" "variable_3" "variable_4" "variable_5"  # > dim(variable_1) # [1] 45000   346  # > variable_1[1:10,1:3] #                  31/12/1982          31/01/1983           28/02/1983 # af3hm5v                  na                  na -1.15076100366945755 # dl8tviy                  na                  na -1.59412257037490046 # jefdypo                  na                  na                   na # 21zv689                  na                  na -0.31095014405320764 # rp1dzhb -1.0571670785223215 -0.7206356272944392 -0.84028668343265112 # n6duszc                  na                  na -1.31113363079930023 # xg3za1w                  na  0.8531074740045220  0.06797987526470438 # w1jcxie  0.2782029710832690 -1.2668560986048898                   na # x3rkt2b  1.5220172324681460 -1.0460218516729356                   na # 3eub8vn -0.9405417187846803  1.1151437940206490  1.60520458945005262

i want able store these in database. rdbms default option i'm willing @ other options. biggest part optimal solution quick querying, whole matrix or subset of matrix, e.g. 2000 symbols, 100 dates etc.

the current solution i've been using saving each matrix rdata file , loading in whole matrix , truncating in size use. quick feel database design more benefitial in terms of scaling matrices in terms of symbols + dates , backups data.

what i've tried far in terms of rdbms options is:

- fields: symbol, variable, date, value - seperate , clustered indices value. - data needs "melted"/pivoted mxn matrix r (crazy memory inefficient) - average query normal sample r: 4-8 minutes

- each variable in sperate table. - fields: symbol, date, value - seperate , clustered indices value. - views added cache common subsets (dunno if helped @ all...) - data needs "melted"/pivoted mxn matrix r (crazy memory inefficient) - average query normal sample r: 3-5 minutes

c) [should maybe have tried column based database here]

- symbols , dates stored seperately , map row , col numbers - each variable in seperate table symbols rows , dates columns - bad data maps disk when scaling rows , cols. - data in correct format r - average query normal sample r: 1-3 minutes

in comparison above database set ups loading in whole variable rdata files takes 5 sec locally , 20 sec on network. database times on network.

is there can make database route come anywhere close binary file speeds?

maybe 1 of tabular nosql databases need?

how scale in terms of additional symbols + dates?

any who's dealt similar appreciated.

update: thought i'd post update this. in end went iterator's suggestion , data hosted in bigmemory memory mapped files , rdata quick use drag , drop checking outputted csv , pulled sql server backup purposes. database solution slow used multiple users. using rodbc against sql server crazy slow, tried input , output r via csv , sql , okay pointless.

also references, byte compiling the load method bigmemory have impact. here results of load test rdata vs bigmemory.

workingdirectory <- "/users/hans/92 speed test/" require("bigmemory") require("compiler") require("rbenchmark")  loadvariablesinfolder <- function(folder, sedols, dates) {     filesinfolder <- dir(folder)     filestoload <- filesinfolder[grepl(".*nvar_.*\\.rdata",filesinfolder)]     filestoload <- paste(folder,filestoload,sep="/")     variablesthatwereloaded <- c()     for(ftoload in filestoload) {         loadedvar <- load(ftoload)         assign(loadedvar,get(loadedvar)[sedols,dates])         gc() -> ans         variablesthatwereloaded <- c(variablesthatwereloaded,loadedvar)         rm(list=c(loadedvar))     }     return(variablesthatwereloaded) }  cloadvariablesinfolder <- cmpfun(loadvariablesinfolder)  bigmloadvariablesinfolder <- function(folder, sedols, dates) {     workd <- getwd()     setwd(folder)     filesinfolder <- dir(folder)     filestoload <- filesinfolder[grepl(".*nvar_.*\\.desc",filesinfolder)]     variablesthatwereloaded <- c()     for(ftoload in filestoload) {         tempvar <- attach.big.matrix(dget(ftoload))         loadedvar <- gsub(".*(nvar_\\d+).*","\\1",ftoload,perl=true)         assign(loadedvar,tempvar[sedols,dates])         variablesthatwereloaded <- c(variablesthatwereloaded,loadedvar)         rm(list=c(loadedvar,"tempvar"))         gc() -> ans     }     setwd(workd)     return(variablesthatwereloaded) }  cbigmloadvariablesinfolder <- cmpfun(bigmloadvariablesinfolder)  testcases <- list(                 list(numsedols=1000,numdates=120),                 list(numsedols=5000,numdates=120),                 list(numsedols=50000,numdates=120),                 list(numsedols=1000,numdates=350),                 list(numsedols=5000,numdates=350),                 list(numsedols=50000,numdates=350))  load(paste(workingdirectory,"dates.cache",sep="/")) load(paste(workingdirectory,"sedols.cache",sep="/"))  (testcase in testcases) {     results <- benchmark(loadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]),               cloadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]),               bigmloadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]),               cbigmloadvariablesinfolder(folder=workingdirectory,sedols=sedols[1:testcase$numsedols],dates=dates[1:testcase$numdates]),               columns=c("test", "replications","elapsed", "relative"),               order="relative", replications=3)     cat("results testcase:\n")     print(testcase)     print(results) }

basically smaller subset more gained because don't spend time loading in whole matrix. loading whole matrix slower bigmemory rdata, guess it's conversion overhead:

# results testcase: # $numsedols # [1] 1000  # $numdates # [1] 120                                                                                                                                # test # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 3  bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1      loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2     cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates])   # replications             elapsed           relative # 4            3   6.799999999999955  1.000000000000000 # 3            3  14.389999999999986  2.116176470588247 # 1            3 235.639999999999986 34.652941176470819 # 2            3 250.590000000000032 36.851470588235543 # results testcase: # $numsedols # [1] 5000  # $numdates # [1] 120                                                                                                                                # test # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 3  bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1      loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2     cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates])   # replications             elapsed           relative # 4            3   7.080000000000155  1.000000000000000 # 3            3  32.730000000000018  4.622881355932105 # 1            3 249.389999999999873 35.224576271185654 # 2            3 254.909999999999854 36.004237288134789 # results testcase: # $numsedols # [1] 50000  # $numdates # [1] 120                                                                                                                                # test # 3  bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2     cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1      loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates])   # replications           elapsed          relative # 3            3 146.3499999999999 1.000000000000000 # 4            3 148.1799999999998 1.012504270584215 # 2            3 238.3200000000002 1.628425008541171 # 1            3 240.4600000000000 1.643047488896482 # results testcase: # $numsedols # [1] 1000  # $numdates # [1] 350                                                                                                                                # test # 3  bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1      loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2     cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates])   # replications            elapsed          relative # 3            3  83.88000000000011 1.000000000000000 # 4            3  91.71000000000004 1.093347639484977 # 1            3 235.69000000000005 2.809847401049115 # 2            3 240.79999999999973 2.870767763471619 # results testcase: # $numsedols # [1] 5000  # $numdates # [1] 350                                                                                                                                # test # 3  bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 2     cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1      loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates])   # replications           elapsed          relative # 3            3 135.6999999999998 1.000000000000000 # 4            3 155.8900000000003 1.148784082535008 # 2            3 233.3699999999999 1.719749447310245 # 1            3 240.5599999999995 1.772733971997051 # results testcase: # $numsedols # [1] 50000  # $numdates # [1] 350                                                                                                                                # test # 2     cloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 1      loadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 3  bigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates]) # 4 cbigmloadvariablesinfolder(folder = workingdirectory, sedols = sedols[1:testcase$numsedols], dates = dates[1:testcase$numdates])   # replications           elapsed          relative # 2            3 236.5000000000000 1.000000000000000 # 1            3 237.2100000000000 1.003002114164905 # 3            3 388.2900000000000 1.641818181818182 # 4            3 393.6300000000001 1.664397463002115

i recommend using hdf5. assume data complex enough variety of bigmemory files (i.e. memory mapped matrices) not satisfy needs (see note 1), hdf5 short of speed of memory mapped files. see this longer answer question understand how compare hdf5 , .rdat files.

most notably, fact hdf5 supports random access means should able substantial speed improvements.

another option, depending on willingness design own binary format, use readbin , writebin, though doesn't have of nice features hdf5 has, including parallel i/o, version information, portability, etc.

note 1: if have few types per row, i.e. 1 character , rest numeric, can create 2 memory mapped matrices, 1 of characters, other numeric values. allow use bigmemory, mwhich, bigtabulate , lots of other nice functions in bigmemory suite. i'd give reasonable effort, it's easy system smoothly integrate lots of r code: matrix need never enter memory, whatever subsets happen need, , many instances can access same files simultaneously. what's more, easy parallelize access using multicore backends foreach(). used have operation take 3 minutes per .rdat file: 2 minutes load, 20 seconds subselect needed, 10 seconds analyze, , 30 seconds save results. after switching bigmemory, got down 10 seconds analyze , 5-15 seconds on i/o.

update 1: overlooked ff package - option, though lot more complex bigmemory.

Search This Blog

Support

Optimal database design in terms of query speed to store matrices from R -

Comments

Post a Comment

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -