Scientific Discovery in the Era of Big Data: More than the

advertisement
Scientific Discovery in the Era of Big Data:
More than the Scientific Method
A RENCI WHITE PAPER
Vol. 3, No. 6, November 2015
ScientificDiscoveryintheEraofBigData:
MorethantheScientificMethod
Authors
CharlesP.Schmitt,DirectorofInformaticsandChiefTechnicalOfficer
StevenCox,CyberinfrastructureEngagementLead
KaramarieFecho,MedicalandScientificWriter
RayIdaszak,DirectorofCollaborativeEnvironments
HowardLander,SeniorResearchSoftwareDeveloper
ArcotRajasekar,ChiefDomainScientistforDataGridTechnologies
SidharthThakur,SeniorResearchDataSoftwareDeveloper
RenaissanceComputingInstitute
UniversityofNorthCarolinaatChapelHill
ChapelHill,NC,USA
919-445-9640
RENCIWhitePaperSeries,Vol.3,No.6 1
ATAGLANCE
•
•
•
•
•
Scientificdiscoveryhaslongbeenguidedbythescientificmethod,whichisconsidered
tobethe“goldstandard”inscience.
Theeraof“bigdata”isincreasinglydrivingtheadoptionofapproachestoscientific
discoverythateitherdonotconformtoorradicallydifferfromthescientificmethod.
Examplesincludetheexploratoryanalysisofunstructureddatasets,datamining,
computermodeling,interactivesimulationandvirtualreality,scientificworkflows,and
widespreaddigitaldisseminationandadjudicationoffindingsthroughmeansthatare
notrestrictedtotraditionalscientificpublicationandpresentation.
Whilethescientificmethodremainsanimportantapproachtoknowledgediscoveryin
science,aholisticapproachthatencompassesnewdata-drivenapproachesisneeded,
andthiswillnecessitategreaterattentiontothedevelopmentofmethodsand
infrastructuretointegrateapproaches.
Newapproachestoknowledgediscoverywillbringnewchallenges,however,including
theriskofdatadeluge,lossofhistoricalinformation,propagationof“false”knowledge,
relianceonautomationandanalysisoverinquiryandinference,andoutdatedscientific
trainingmodels.
Nonetheless,thetimeisrightforincreasedfocusontheconstructionofCollaborative
KnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasources
andintegratetraditionalandemergingscientificmethodsandtherebydrivescientific
discoveryandapplication.
Introduction
Knowledgediscoveryinsciencereferstothesystematicprocesswherebyscientistsdrawlogical
conclusionsregardingtheworldaroundus,generatenewtheoriesbasedonthoseconclusions,
andsharefindingswithotherscientistsandthelaypublic,thusenablingcriticalreviewand
consensusbeforenewfindingsareaddedtothecollectivebodyofknowledge.Historically,
scientificdiscoveryhasbeenguidedbythescientificmethod,whichdatestoancienttimesand
involvesbothaphilosophicalandpracticalapproachtoscience.Indeed,therenowned
philosopherAristotle(384-322BC)wasoneofthefirsttoapproachknowledgediscovery
throughrigorous,systematicobservation,althoughitwasn’tuntilmillennialaterthatthe
scientificmethodwasactuallyformalizedandimplemented,largelythroughtheworkof
Copernicus(1473-1543),TychoBrahe(1546-1601),JohannesKepler(1571-1630),GalileoGalilei
(1564-1642),ReneDescartes(1596-1765),andIsaacNewton(1643-1727)(Gower1997;Betz
2011).
Atfirstglance,thescientificmethodappearstoberelativelysimpleandstraightforward.In
sum,themethodinvolvesarepeatingcycleofstandardizedsteps:theprocessbeginswith
carefulobservationofthenaturalworld,theframingofaquestiononthebasisofone’s
observations,andareviewoftheexistingbodyofknowledgetodetermineifareasonable
RENCIWhitePaperSeries,Vol.3,No.6 2
explanationalreadyexists.Assumingthatthequestionremainsopenended,ascientistwill
thenformulateahypothesis,designandimplementanexperimenttotestthehypothesis,and
analyzetheexperimentalresults.Theanalyticalresultsareusedtoformallyacceptorrejectthe
hypothesis.Thehypothesismaythenbemodified,withtheexperimentrepeatedoranew
experimentdesigned.Importantly,theresultsarethendisseminatedviapresentationtopeers
andpublication,whichallowforadjudication1orpeerconsensusregardingthevalidityofthe
scientificfindings.This,inshort,isscientificdiscoveryviathescientificmethod.
RevisitingCharlesDarwin
Asanyschoolchildwillattest,theworkofCharlesDarwinprovidesanexemplarofthescientificmethodandthe
manychallengesinvolvedinscientificdiscovery(Burkhardt1996;McKie2008;Montgomery2009).Darwin’sgenius
perhapsliesinhiskeenabilitytoobservethenaturalworld.Oneofhisearliestobservationswasthatsimilar
speciesarefoundacrosstheglobeandthatindividualswithinaspeciesarenotidenticalbuthavelocalvariation.
Hequestionedwhymultiplespeciesexistinsteadofjustone.Darwincontinuouslyresearchedandreviewedthe
existingscientificliterature(i.e.,theestablishedscientificbodyofknowledge).
Darwin’sworkwasinfluencedbytheresearchandwritingsofThomasMalthus,whofoundthathumansproduce
moreoffspringthanareneededtoreplacethemselvesandspeculatedthatpopulationsizewouldsoonexceedthe
availableresourcesrequiredforsurvival.Darwinalsoobservedthatpopulationsofplantsandanimalsstayabout
thesamesizebecauseoflimitedresourcesandcompetitionforthoseresources.Darwin’sthinkingalsowas
influencedbytheresearchandwritingsofCharlesLyell,whofoundthatsmall,gradualgeologicalprocessescan
producelargechangesovertime.Darwinmadeseveralbrilliantinferencesonthebasisofhisscientific
observationsandtheworkofMalthus,Lyell,andothersthatledhimtohypothesizethatspecieschangeslowlyin
aprocessofevolutionfromacommonancestor.Hethenspentdecadesinobservationalexperimentation,
ongoinganalysisofhisscientificfindings,andrefinementofhishypothesisuntilheeventuallyreachedthenow
famousconclusionthattheoriginofthespeciesliesinnaturalselection,ortheprocesswherebyindividual
variation,coupledwithcompetitionamongindividualsfornaturalresourcesandsocialcooperationamongkinto
increaseindividualfitness,determinesdifferencesamongrelatedspecies.
Coincidentally,whileDarwinwasrefininghishypothesisanddrawingaconclusion,AlfredR.Wallace,amuchjunior
researcherwhowasfamiliarwithDarwin’swork,reachedasimilarconclusion,whichheplannedtopublishand
2
disseminatetoscientificpeers.Awarethathisworkmightgounrecognized, Darwinreachedanagreementwith
Wallace,brokeredbyLyellandJosephD.Hooker,andreluctantlypublishedajointscientificmanuscriptin1858.
OntheOriginofSpeciesbyNaturalSelectionwasn’tpublisheduntilNovember1859,butonlyasabookchapter,
notthefullbookthatDarwinhadintended.Interestingly,thescientificcommunityreactednegativelytoDarwin’s
work(e.g.,Gray1860);manyofDarwin’speersfeltthathedidnothavesufficientevidencetoputforthhis
hypothesis,whileothersweredisturbedbythetheological,political,andsocialimplications.Thescientificdebate
continuedfornearlyacenturyuntilthescientificcommunityreachedadjudicationonDarwin’shypothesisand
scientificfindingsandestablishedthetheoryofevolutionbynaturalselection.Ofnote,thelaydebatecontinues
today,largelyontheologicalandpoliticalgrounds.
1
“Adjudication”isalegaltermthatreferstodecisionmakinginthepresenceofaneutralthirdpartywhohasthe
authoritytodetermineabindingresolutionthroughsomeformofjudgmentoraward.Inscience,adjudication
generallyreferstodecisionmakingandconsensusbuildingamongagroupofwidelyregardedscientificexpertson
thetopicunderdiscussion(Spangler2003).
2
“Ineversawamorestrikingcoincidence,ifWallacehadmyM.S.sketchwrittenoutin1842hecouldnothave
madeabettershortabstract!EvenhistermsnowstandasHeadsofmyChapters…Soallmyoriginality,whateverit
mayamountto,willbesmashed.ThoughmyBook,ifitwilleverhaveanyvalue,willnotbedeteriorated;asallthe
labourconsistsintheapplicationofthetheory”—CharlesDarwintoCharlesLyell,June18,1858(InCharles
Darwin’sLetters.ASelection,F.Burkhardt(Ed.),1996).
RENCIWhitePaperSeries,Vol.3,No.6 3
Thescientificmethodhascertaindesirablecharacteristicsthathaveenabledittowithstandthe
passageoftime,evenasnewscientifictoolsandtechniqueshavebeenintroducedtothe
scientificprocess.Forexample,themethodisobjectiveandremovedfromallpersonaland
culturalbiases.Allhypothesesaredevelopedtobeconsistentwithacceptedscientifictruthsat
thetimethatthehypothesisisgenerated.Allmeasurementsmustbeobservableandpertinent
tothehypothesis.ThehypothesisandallconclusionsmustfollowtheprincipleofOccam’s
Razorintermsofparsimony,orsimplicitywithfewassumptions.Thehypothesisalsomustbe
falsifiable(andcapableofbeingdisproven).Finally,thescientificfindingsmustbereproducible
byotherscientists.
Withoutdispute,thescientificmethodremainsthemostcommonandonlyvalidatedapproach
toscientificdiscoveryandknowledgeextraction.Infact,drugdevelopmentintheU.S.relies
exclusivelyontherandomizedplacebo-controlledclinicaltrial—theexemplarofthescientific
method.
However,today’sadvancementsindigitalcomputingandstoragecapabilities,coupledwith
newmethodsforscientificcommunication,includingsocialmedia,areintroducingnew
approachestoscientificdiscovery,eachofwhichbringschallengesandopportunities.Inthis
whitepaper,weconsiderhowtheadventofthedigitalageandtoday’sworldof“bigdata”are
changingscientificdiscoveryprocesses.WeclosewithaframeworkforCollaborative
KnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasourcesand
integratetraditionalandnewdata-drivenscientificmethods,allowingforunprecedented
advancesinscientificdiscoveryandapplication.
Exploratorydatasetsandexploratoryanalysis
Today’sscientisthasaccesstonumerouslargedatasetsofrelevancetomultiplescientific
domains.Forexample,theNationalOceanicandAtmosphericAdministrationmaintainsseveral
databasescontainingdataonclimatepatterns,earthquakes,ozonelevels,andocean
temperatures;thesedataareusefultoscientistsinmanyfields,includingenvironmental
science,energy,publichealth,andmedicine.Scientistsareincreasinglyaccessingthesedata
setsforexploratoryanalysis,whichisanapproachusedtodetermineifageneralhypothesis
bearsanymeritand/orifanexperimentaldesignisfeasible.Exploratoryanalysistypically
beginswithageneralhypothesisorexperimentaldesignthatisn’twellfleshedoutandthe
applicationofavarietyofstatisticalapproachesandvisualizationtechniquestoidentifyand
validatedataelementsand/ordetermineifahypothesisistestableusingthedata(Behrens
1997;Gelman2004;Diaconis2011).
Forexample,consideraneconomicsresearcherwhoisinterestedinlearningabouttheloaning
practicesofalargecreditunion,intermsofthebreakdownofmortgageloansacross
socioeconomicsectors.S/herequestsaccesstothecreditunion’sloandatabaseinorderto
determinehowthedataarestructuredandhowfine-grainedtheavailablesocioeconomicdata
elementsare.Theresearcherdeterminesthatthedatabasecontainsdataelementsontheage,
sex,andgrossincomeofloanholders,aswellastheloanamountandthelocationofthe
property,butitdoesnotcontaininformationontheethnicityoftheloanholders.The
RENCIWhitePaperSeries,Vol.3,No.6 4
researcherusesthisinformationtomodifyhis/herhypothesisforsubsequenttestingusingthe
samedata.
Exploratoryanalysishasalwaysbeenpartofscientificdiscoveryand,historically,hasbeenused
togeneratethedrivingquestionunderlyingascientifichypothesis.However,inthepast,the
processwasinformal,slow,andobservation-based(e.g.,detailednotesonthetypesoffoliage
identifiedatdifferentaltitudeswithinagivenregion);whereastoday,itisfast,large-scale,
data-drivenandofteninvolvesextensiveuseofadvancedstatisticalmethodsandvisualization
techniques.Furthermore,largedatasetsarebeinggeneratedstrictlyforexploratoryanalysis,
andoftensuchdatasetsincuraconsiderableexpensewithquestionablecost-benefit.3Arecent
McKinseyreport(Manyika,etal.2015),forexample,estimatesthatlessthan1%ofdata
capturedandstoredfromanoffshoreoilrigequippedwith30,000sensorsisactuallyusedto
guideoperations.Thevalueoftherestofthedataremainstobedetermined.
Today,exploratoryanalysisisfast,large-scale,data-drivenandofteninvolves
extensiveuseofadvancedstatisticalmethodsandvisualizationtechniques.
Intermsofbenefits,exploratoryanalysisenablesascientisttoquicklydevelopamorerefined,
testablehypothesisandrigorousexperimentaldesignforsubsequenthypothesistesting
(Behrens1997;Gelman2004;Diaconis2011).Challengesincludethecostsinvolvedwiththe
generationandlong-termstorageofexploratorydatasets.Inaddition,drawingconclusions
fromananalysisofexploratorydatasetscanintroduceerrorifthedataelementsarenot
describedwithsufficientmetadatatofullyunderstanddatastructureandmeaningorifthe
dataelementshaveinherentbiasesduetohowtheywerecollected.Additionalchallenges
includeissuesofprivacyandtheinadvertentorintentionalleakageof“sensitive”data,
particularlyifascientistdoesnotobtaintherequisiteauthorizationtoaccessadatasetthat
containssensitivedataorifsensitivedataareaccidentallyprovidedtoanunauthorizedorthirdpartyuser(Behrens1997).
Datamining
Dataminingoftenbeginswithoutahypothesisandinvolvestheapplicationoftoolsand
techniquesfromstatistics,mathematics,andvisualizationtoidentifypreviouslyunknown
patternsandtrendsinadatasetderivedfromoneormoreexistingdatabases(Fayyad,etal.
1996;Holzinger,etal.2014).Whiledataminingissometimesconsideredaformofexploratory
dataanalysis(Behrens1997;Diaconis2011),weargueforadistinctionbetweenexploratory
dataanalysis,whichenablesascientisttoexploreadatasetintermsofstructureandtypesof
dataelements,includingtherelevancyofexternaldatasources(e.g.,literaturecitations)to
3
Thecostsandbenefitsoftheseexploratorydatasetsareanopentopicofdiscussion;forexample,see
Wilhelmsen,etal.2013andEvans,etal.2015fordiscussionsontheprosandconsofcapturingandstoringwhole
genomeversustargetedgenomesequencingdata.
RENCIWhitePaperSeries,Vol.3,No.6 5
dataelements,anddatamining,whichenablesascientisttoidentifypreviouslyunknown
patternsinadatasetintermsofrelationshipsbetweendataelements.Wenotethatnew
statisticalapproachesarebeingdevelopedtoovercomesomeofthelimitationsofbothtypes
ofscientificdiscovery.Forexample,themaximalinformationcoefficient,orMIC,overcomes
thelimitationsofPearson’scorrelationcoefficient,r,byallowingfortheidentificationof
complex,nonlinearrelationships(e.g.,exponential,periodic)betweendataelementsindata
setsofanysize(Reshef,etal.2011).4Supposeamicrobiologistgeneratesageneexpression
datasettoidentifygenesinvolvedintheregulationofthecellcycle.Usingtraditionalstatistical
approaches,thescientistisabletoidentifygeneswithstrongdeterministicorlinear
associationswithdifferentaspectsofthecellcycle(e.g.,agenethatisrequiredforchromatin
assembly).UsingapproachessuchastheMIC,thescientistisabletoidentifygeneswith
periodicrelationshipswiththecellcycle(e.g.,ageneassociatedwithanestablishedcyclical
eventoraneventthatoccursatapreviouslyunknownfrequencyduringthecellcycle).
Dataminingbearslittleresemblancetothescientificmethod.Inparticular,dataminingisnot:
(1)constrainedtobeobjective;(2)consistentwithacceptedscientifictruths;(3)parsimonious;
or(4)falsifiable.Inaddition,allmeasurementsareobservableonlyafterthefact.Nonetheless,
dataminingoffersbenefits,includingtheabilitytorelativelyquicklydiscoverpreviously
unknownrelationshipsandtherebygenerateanewhypothesisthatcanthenbetestedusing
thescientificmethod(Fayyad,etal.1996;Holzinger,etal.2014).Althoughoftencitedasa
criticismofdatamining,akeybenefitistheabilitytogeneratemultipleassociationswithina
singledataset,whichmayyieldpowerfulnewinformation,especiallywhencombinedwith
associationsidentifiedinotherdatasets.Inthisregard,dataminingcanbeviewedasakinto
meta-analysisofclinicaltrialdata.
Akeybenefitofdataminingistheabilitytogeneratemultipleassociationswithin
asingledataset,whichmayyieldpowerfulnewinformation,especiallywhen
combinedwithassociationsidentifiedinotherdatasets.
Challengesincludethefactthatdataminingrequiresextensivetrainingbeyondtheskillsofa
typicaldomainscientist;withoutsuchtraining,ascientistmayidentifypatternsorassociations
thatarenotvalidorreproducible(Fayyad,etal.1996;Behrens1997;Diaconis2011;Holzinger,
etal.2014).Further,thereisnocommonlyacceptedapproachormethodfordatamining,
whichmakesthefieldsomewhatmoreofanartthanascience(Fayyad,etal.1996;Diaconis
2011).Thereisalsoatendencytotreatallfindingsasconclusivewhentheymaybechance
4
Asanaside,wenotethatthetheoreticalfoundationoftheMICliesintheconceptofmutualinformation(MI)in
pairsofrandomvariables,whichwasdevelopedbyClaudeShannon,thefounderofinformationtheory,morethan
50yearsago(Speed2011).TheimplementationofMIintopracticeasMICcannotbeachievedthroughmanualor
semi-manualcomputationandinsteadrequiressignificantdigitalcomputationalpower—somethingnotavailable
untilrecently.
RENCIWhitePaperSeries,Vol.3,No.6 6
findings(Fayyad,etal.1996;Diaconis2011).Inaddition,whilethepowerofdatamining
increaseswhenmultipleheterogeneousdatasetsareintegratedbeforethedataaremined,so
toodoesthecomplexityoftheprocess(Fayyad,etal.1996;Holzinger,etal.2014).Withhighdimensionaldata,multipleapproachesmustbeusedtoanalyzethedata;forexample,
sophisticatedvisualizationapproachesmayneedtobecombinedwithtraditionalstatistical
approachestodatamining(Fayyad,etal.1996;Diaconis2011;Holzinger,etal.2014).
Computermodeling
Computermodelinginvolvesconceptual,mathematical,computer-generated,orphysical
representationsofreal-worldobjectsorphenomenon(Bowers2012;Buytaert,etal.2012;
Perraetal.,2012;Berman,etal.2015).Itisusedtotestahypothesisorobserveand
manipulateanobjectorphenomenonthatisotherwisedifficult(orunethical)toobserveand
manipulate.
Asanexample,considerascientistwhowishestodeterminehowthebeta-amyloidprotein
influencesthedevelopmentofAlzheimersDisease.S/hedevelopsasoftwareprogramusing
establishedprinciplesofproteinfoldingtovisuallyexplorein3-Ddifferentscenarioswhereby
beta-amyloidmayfoldinthebraintoinfluencecognitivefunctionandleadtothedevelopment
ofAlzheimersDisease.
Modelingrepresentsafundamentalaspectofscientificdiscoveryandhasbeenused
throughoutthehistoryofmodernscience(e.g.,anatomicalmodels).Theuseofcomputer
modelingisrelativelynew,however,andassuch,computer-drivenmodelingtendstobea
highlyspecializedtoolasopposedtoageneraltool.Benefitsofcomputermodelingincludethe
additionofapotentiallypowerfultooltotheformerlymanualprocess,particularlywhen
modelsincorporatethevaststreamsofdatathatareavailablefromtechnologiessuchas
crystallographyandothersophisticatedsourcesofdata(Perra,etal.2012;Berman,etal.2015).
Computermodelsbecomeevenmorepowerfulwhentheyaregeneratedusingdataderived
frommultidisciplinaryscience(Bowers,etal.2012;Buytaert,etal.2012).
Challenges,however,includethefactthatcomputer-generatedmodelsareonlyasgoodasthe
underlyingdataandsoftwareprogramsusedtocreatethem(Joppa,etal.2013;Berman,etal.
2015).Inaddition,theriskofintroducingerrorincreasessignificantlywhenmodelsarecreated
forpoorlyunderstoodobjectsorphenomenon,whenthedataarequalitativeorotherwise
describedinanon-standardizedformat,orwhenmodelsdevelopedinonefieldareapplied
(withoutvalidation)toanotherfieldorevensharedbetweendifferentresearchgroupswithin
thesamefield(Bowers,etal.2012;Buytaert,etal.2012;Perra,etal.2012;Joppa,etal.2013;
Bergman,etal.2015).Onceintroduced,errorstendtopropagateandmayevenamplify
(Buytaert,etal.2012).Moreoever,manymodelsarenotdesignedtoworkinadynamic,
flexible,user-driven,webenvironment(Buytaert,etal.2012).Finally,modelingcostscanbe
quitehigh,sometimeshigherthanthecostsoftheinstrumentsusedtogeneratethedatathat
areusedinthemodels(Berman,etal.2015).
RENCIWhitePaperSeries,Vol.3,No.6 7
Interactivesimulationandvirtualreality
Interactivesimulationandvirtualrealityaresimilartomodelinginthatacomputerisusedto
createrealisticscenariosfortestingahypothesisormanipulatinganobjectorphenomenon
thatisotherwisedifficult(orunethical)totestormanipulate.However,thedifference,as
definedherein,isthatwithinteractivesimulationandvirtualreality,humanbehavior,notthe
underlyingsoftwareprogram,guidestheoutcomeofthesimulationorvirtualexperience(Zyda
2005;Kunkler2006;Diemer,etal.2015).
Supposeamedicaldevicecompanyhasbeendevelopinganewanesthesiamachine.Inorderto
obtainapprovalfromtheU.S.FoodandDrugAdministration,thedevicehastobetestedfor
safetyinPhaseIandIIclinicaltrials.Toachievethis,thecompanycreatesa“dummy”patient
thatisprogrammedtomimicphysiologicalresponsesduringparticulartypesofsurgeriesina
simulatedoperatingroom.APhaseIclinicaltrialisthenconductedinthesimulated
environment,usingateamofactualsurgeons,anesthesiologists,andnursesasstudysubjects.
Interactivesimulationandvirtualrealitycanbeusedforhypothesistestingaspartofthe
scientificmethod,buttheyalsocanbeusedforexploratoryanalysis.Thebenefits
includetheabilitytoinvestigatehumanbehaviorinthecontextoftechnologyorcomputerassistedscenariosthatresembletherealworld(Zyda2005;Kunkler2006;Diemer,etal.2015).
Thus,theapproachfacilitatesbothteam-basedscienceandscienceonhumanteams.Several
challengesexist,however.First,ateamwithexpertiseinbothsoftwaredevelopmentand
humanbehaviormustcreatethesoftwareprogramsthatdrivethesimulationsandvirtual
scenarios,withcarefulattentiontoeverydetailofthescenario(Diemer,etal.2015).Also,
humanbehaviorinasimulatedenvironmentmightnotbeidenticaltohumanbehaviorinthe
realworld;atpresent,softwareprogramscannotfullycapturethetruehumanexperience,
includingbehaviorssuchasemotionalreactions(Kunkler2006).Finally,technological
challengesgrowexponentiallywiththecomplexityofthescenario,andupfrontcostsarehigh
(Zyda2005).
Scientificworkflows
Scientificworkflowsrefertotheabstractstepsrequiredtocompleteaspecificscientifictask.
Today,thesearetypicallydesignedasspecializedworkflowmanagementsystemsthatare
capableoforchestratingandautomatingmanyofthesteps,includingdataflowand,insome
cases,personnel(e.g.,scheduling,accessrights,alertsorreminders),requiredtocompletea
specifictaskand/ortestahypothesis(es)(Curcin&Ghanem2008;Perraud,etal.2010;
Achilleos,etal.2012;Bowers2012;Guo2013).Theconceptoftheautomatedscientific
workflowappearstohaveoriginatedwithanundatedpublicationbySinghandVouk(see
references).Scientificworkflowsareusedtoautomatetedious,time-consuming,orhighly
complexstepsinexperimentaltestingand/oranalysis.
Forinstance,imaginethatabiomedicalresearchcompanyreliesonflowcytometry,a
techniquethatenablesthevisualizationandquantificationoftheproteincompositionoflive
cells,asacriticalpartofitsdrugdiscoveryeffortsindiabetes.Thecompanydecidesto
RENCIWhitePaperSeries,Vol.3,No.6 8
automatemuchoftheprocessandhiresasoftwareengineeringteamtodesignascientific
workflowsystemtoautomateandcoordinatethetechnologiesandresearchteamsinvolvedin
eachstepoftheprocess,includingisolationofbloodcellsfrompatientswithdiabetes,
incubationofcellswithconjugatedantibody(ies),multiplewashes,immunofluorescence
visualization,andanalysistodeterminecellularsubpopulations.
Thedesignofmostscientificworkflowsystemsmodelstheactualscientificmethod,exceptthat
themeasurementsareautomatedandnotalwaysobservablebythescientist.Moreover,
scientificworkflowsareincreasinglybeingusedforautomateddeductionandinference,which
arestepsthathistoricallyreliedonthescientist.Intheexampleabove,forinstance,theuseof
flowcytometrytoidentifycellularsubpopulationsonthebasisoftheirfluorescentprofilecan
bemoreofanartthanascience,especiallywhenmultiplefluorescentconjugatesareused;as
such,automatingtheprocesscanyielderroneousorinconsistentresults.Thatsaid,scientific
workflowsenableascientisttoconductexperimentsmorequickly,efficiently,andwithgreater
statisticalpowerandreproducibilityduetothelargedatavolumesthatawell-designed
scientificworkflowcanhandleandtheabilitytointegrateheterogeneousdatasources,oftenin
realtime(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Guo2013).
Scientificworkflowsareincreasinglybeingusedforautomateddeductionand
inference,whicharestepsthathistoricallyreliedonthescientist.
Whilepotentiallyquitepowerful,thescientistintherealmoftheworkflowisoftendependent
onthetechnologyandremovedfromcriticaldecision-makingsteps(e.g.,calculations,
incubationtimes,etc.),which,intheabsenceofcarefulsystemdesignandimplementation,
threatensthevalidityandreproducibilityofworkflowfindings.Furthermore,unlessthe
workflowstepsareclearlyannotatedandopenlyaccessible,apeerreviewerwillbeunableto
fullyevaluateandreproducethescientificfindings(Bowers2012).Moreover,workflowdesign
canbequitecomplex,involvinghundredsofindividualanalysissteps,largesetsof
heterogeneousdata,andmultipleexistingworkflowsthatoftenwerenotdesignedtowork
together;thecomplexitypresentsdifficultiesinuser-friendliness,design,andcaptureand
recordofprovenance5(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Gup2013).
Thereisalsoapracticallimittothedegreeofgranularityinworkflowtasks;highlygranular
activitiesaretypicallynotfeasibleduetoanegativeimpactonoverallsystemperformance
(Perraud,etal.2010).Anadditionalconcernisthatscientificworkflowscanintroduce
systematicbiasinthaterrors(e.g.,anincorrectcalculation)maybeintroducedandpropagated
indefinitelybecauseautomationmakesitmoredifficulttodetectandcorrecterrors.Finally,
workflowstendtobeveryspecializedandoftencannotberealisticallyadaptedfordifferent
domains(Curcin&Ghanem2008).
5
“Provenance”referstothecaptureofmetadatathatdescribestheoriginsofthedata,eachstepofdata
transformationandanalysis,andarecordofversioningtoidentifyhowup-to-datethedataare(Guo2013).
RENCIWhitePaperSeries,Vol.3,No.6 9
Disseminationandadjudication
Disseminationandadjudicationofscientificdiscoveriesarecriticalcomponentsofscientific
discovery,asthesearetheprocessesthroughwhichnewscientific“truths”areaddedtothe
collectivescientificbodyofknowledge(Smith2006;ACSpublications2013;Almeida2013;Cold
SpringHarborLaboratory2014).Disseminationhashistoricallybeenachievedthrough
presentationtoscientificpeersandpublicationinscientificjournalsinordertoobtaincritical
peerreviewandvalidation(orrejection)ofscientificfindingsandinformthescientificandlay
communities.Adjudicationisthemethodbywhichaconsensusisreachedamongscientific
expertsregardingthevalidityofthescientificfindings(alsoseefootnote1).
Asanexampleofthedisseminationandadjudicationprocesses,assumeascientistdeliversa
PowerPointpresentationonnewtechnologiesforhydrologyatascientificconferenceonwater
safety.S/hereceivesimmediatefeedbackonthepresentationfromcolleaguesinattendanceat
thelecture.Onthebasisofthatfeedback,thescientistdecidesthatanadditionalexperimentis
requiredbeforehis/herworkisreadyforpublicationinapeer-reviewedjournal.Toprovide
anotherexample,imagineascientistintendstopublishhis/hernewceramicsmodelina
scientificjournalfocusedonmaterialsscience.Thejournalrequirescriticalpeerreview.6The
scientistsubmitsthemanuscripttothejournalforpotentialpublication,anditisreviewedby
anonymouspeers.Onthebasisofthepeerreview,theeditordecidesthatbeforethe
manuscriptcanbepublished,thetextneedstoberevisedtobetterframetheneedforanew
ceramicsmodelinthecontextofexisting,similarmodels.
Historically,disseminationandadjudicationhavebeenkeycomponentsofthescientificmethod
andthefinalscreenbeforenewscientifictruthsareaddedtothescientificbodyofknowledge.
Thebenefitsaretremendousbecausetheprocessenablesscientificfindingstoberigorously
evaluatedbyexpertscientists,thusensuringtherobustnessandintegrityofscientificfindings
(Smith2006;ACSpublications2013;Almeida2013;ColdSpringHarborLaboratory2014).The
challenges,however,aregrowingintheeraofbigdata.Specifically,thedigitalworldis
changingthewaythatscientificfindingsaregenerated,presented,published,andcatalogued.
Forexample,traditional,paper-based,peer-reviewedscientificjournalsarecompetingwith
new,electronic,openaccessscientificjournals7thatmaynotrequireaslow,rigorouspeer
review,butonlyaquick,minimaleditorialreviewbeforepublication(Almeida2013;ColdSpring
6
“Peerreview”referstotheprocesswherebyscientificjournalsorscientificfundingorganizationsenlist(forfree)
thehelpofcolleagueswithinthesamefieldasthemanuscriptorgrantproposal’sinvestigativeteamtoevaluate
thescientificmeritandintegrityofadocument(Smith2006).
7
“Openaccess”scientificjournalsareonlinejournalsthatpermitaccesstoalljournalcontentwithouta
subscriptionfee.(ColdSpringHarborLaboratory2014).Thepremiseistoallowordinarycitizenstoaccessall
scientificfindings,particularlythosethataregeneratedwithfederalmoney,asa“publicgood”(e.g.,
http://publicaccess.nih.gov/),muchthesamewaythatfederalsalariesarereleasedtothepublicforevaluation.
“Publicgoods”,asdefinedbyeconomists,arethoseitems(commoditiesorservices)thatarebothnon-excludable
andnon-rivalrous;examplesincludepublicparks,sewersystems,highways,policeservices,etc.(Samuelson1954).
Ironically,manyjournalsrequiretheauthortopayafeeforopenaccesspublication. RENCIWhitePaperSeries,Vol.3,No.6 10
HarborLaboratory2014).Inaddition,manyscientistsarenowpublishingnon–peer-reviewed
scientificfindingsinelectronicform,viawebsites,digitalwhitepapersandtechnicalreports,
blogs,webcasts,YouTubevideos,etc.(ACSpublications2013;Almeida2013).Thisapproach
allowsforrapiddisseminationofscientificfindings,butitbypassesboththepeerreview
processforpublicationandthepeerselectionprocessforspeakerpresentationatascientific
meeting.Thatsaid,thecurrentpeerreviewprocessisnotwithoutflaw,intermsofbiasand
error(Smith2006),soperhapstheshifttodigitalself-publicationwillreformthepeerreview
processorkindleanentirelynewformofpeerreview.NewhybridmodelssuchaseGEMs
combinepeerreviewwithfreepublicationandopenaccesspermissions;whetherthesemodels
arefiscallysustainablehasyettobeseen.Anotherchallengeisthewealthofscientific
informationavailabletoday.Thisdisseminationdelugemakesitchallengingtoadjudicate
existingfindingsandidentifyimportantnewfindingsthatmaybecomelostinthewealthof
availabledata(i.e.,the“longtailofscience”)(Trader2012).
Knowledgebases
Traditionally,thedisseminationandadjudicationofscientificknowledgeinvolvesocial
processes;howeverthestorage,integration,search,andinferenceofknowledgeisrapidly
changingtoahybridmodelthatreliesequallyuponhumansandlarge,complex,collectionsof
digitalinformationthatincludedata,relationshipsbetweendataelements,humanassertions
onthedata,and,insomeinstances,capabilitiesforautomatedcognitiveprocessing:
knowledgebases.Theterm“knowledgebase”wasintroducedinthe1970s(Jarke,etal.1978)to
distinguishitfromtheexistingdatabaseofthetime,whichessentiallystoreddataintabular
formforuseraccessviaquery.Thetermremainslooselydefined,butaknowledgebasecanbe
consideredtobeacomputingsysteminwhichassertionsarerepresentedandpersistedwithin
anoverarchingontologicalframeworkthatprovidesthesemanticcontextandthatallowsfor
queryingandcomputingacrossassertions.Thehumanuserisintegraltotheknowledgederived
fromandcontainedwithintheknowledgebase;and,insomecases,thesystemcanbe
structuredtoderivenew,automatedassertionsbasedonmachinelearningornewdata.While
traditionaldatabaseshaveevolvedtobecometransaction-basedandrelationalandcanbepart
ofaknowledgebasesystem,knowledgebasesremaindifferentiatedinthatthehumanuser’s
abilitytodynamicallyaccess,addto,andusetheknowledgeisanessentialpartofthedesign
andfunctionofthesystem(e.g.,Wikipedia)(Pugh&Prusak2013).
Completelyautomatedknowledgebasesareunderdevelopment,butautomatedcurationof
humanassertionsisnotyetpossibleandmayneverbe.Nonetheless,emerging
knowledgebasesareincorporatingsophisticatedalgorithmsforautomatedandsemiautomatedstructuring,parsing,summarization,retrieval,andvisualizationofinformation.
Leading-edgeproductionknowledgebasescurrentlycontain109to1011assertionsminedfroma
varietyofunstructuredandsemi-structureddatasources,includingWikipediaandPubMed
(Bordes,etal.2015;Southern2015).Prominentcommercialexamplesofknowledgebases
includetheIBMWatsonsystemandElsevierPathwayStudio;proprietaryexamplesinclude
WalMart’sSocialGenomeandGoogle’sKnowledgeGraph;opensourcesystemsinclude
OpenBEL,OpenPHACTS,andDARPA’sDeepDive.OtherexamplesincludetheSemanticWeb
andtheLinkedDatamovement.
RENCIWhitePaperSeries,Vol.3,No.6 11
Theuseofknowledgebasesinscientificresearchisgrowing.Forinstance,theGenBank
knowledgebaseallowsonetosearchforspecificgenesandthenidentifyawealthofcurated
informationaboutthatgeneandrelatedgenesandexternaldatasources.TheGeneOntology
(GO)knowledgebaseprovidesannotationongenes,biologicalprocesses,cellularcomponents,
andmolecularfunctionsandhasbeenprominentintheapplicationofknowledgebasesfornew
statisticalapproachessuchasenrichmentanalysis(Mi,etal.2013).Asscientistsbecomemore
awareofthebenefitsofknowledgebases,andasthequalityandquantityoftheirassertions
increase,theirapplicationinscientificdiscoveryisexpectedtogrow.
Emergingknowledgebasesareincorporatingsophisticated
algorithmsforautomatedandsemi-automatedstructuring,parsing,
summarization,retrieval,andvisualizationofinformation.
Majorchallengesinthedevelopmentofknowledgebasesforscientificdiscoveryincludethe
costsofhumancuration,intermsoftheamountoftimeandexperiencerequiredtocontribute
toaknowledgebaseinameaningfulway,andthelackofincentivesforscientiststocontribute
toknowledgebases,especiallythoseperceivedtobeoutsideofascientist’sareaofexpertise.
Asexemplifiedintheknowledgebasesystemslistedabove,recentresearchhasadvancedour
understandingofhowtoconstructknowledgebases,includingtheincorporationofprobabilistic
models,structuredrepresentationofunstructureddata,deep-learningsystems,questionanswerschemas,andnaturallanguageprocessingalgorithms.Recentresearchalsohas
advancedourunderstandingofhowtointegrateknowledgefromdifferentsourceswith
differingqualityanddifferingsemantics(Southern,2015;Nickel2015).Forexample,DeepDive
isbuiltuponaprobabilisticgraphmodelthatallowsfortheencodingofassertionsand
relationshipsbetweendataelementsthathaveinaccuracies,suchasthosethatarederived
frompredictivemodelsanddata-drivenapproaches,alongwithassertionsthathavehigher
validity.Thesystemusesinferenceenginestominetherelationships,aswellasthestrengthof
theassertions,inordertoconstructdifferent“worldmodels”uponwhichnewinferencescan
bemade.Theseadvancesshowpromise,butthehumanuser,thescientist,remainsthe
roadblockinthefurtherdevelopmentandwidespreadapplicationofknowledgebasesfor
scientificdiscovery.
TheBigPicture:TheFutureofKnowledgeDiscoveryinScience
Werecognizethattherehavealwaysbeenscientificapproachesthatdonotrelyonthe
scientificmethod,asothershavearguedpreviously(e.g.,Cleland2001).Yet,thescientific
methodremainsthe“goldstandard”intermsofscientificdiscovery.Thewealthofdata
availabletodayofferstheunprecedentedopportunitytoconductscienceinwaysneverbefore
envisioned:real-timescience,real-worlddata,andnewmethodsofscientificdiscovery.In
embracingnewapproachestoscientificdiscoverythatarenotnecessarilyalignedwiththe
RENCIWhitePaperSeries,Vol.3,No.6 12
scientificmethod,ormaybeevencontradictit,wemustconsiderhowtobestemployandunify
thenewapproacheswitheachotherandwiththetraditionalscientificmethod.
JohnTukey,consideredbymanytobethemostinfluentialstatisticianinmoderntimes,once
stated:“Thebestpartofbeingastatisticianisthatyougettoplayineveryone’sbackyard.”
AmongTukey’smanyaccomplishmentsaretheintroductionoftheterms“bit”(in1946)and
“software”(in1958)andtheconceptsandmethodsofexploratorydataanalysis,datamining,
theTukeyFastFourierTransformation,andavarietyofotherstatisticalandmathematical
approachesforscientificdiscovery,manyofthemalsonamedafterhim(Bittrich2000;
Leonhardt2000;Brillinger2002).Tukeyalsointroducedtheterm“uncomfortablescience”,
whichhasbeendescribedaslyingsomewherebetweenclassicalmathematicalstatisticsand
“magicalthinking”(Diaconis2011).Theconceptisthatinmanyinstances,scientificinference
canandmustbemadefromintuitionandexploration,ratherthancontrolledexperimentation,
usingafiniteamountofpotentiallyrichdatathatareflawedandoftennonreplicable.8Tukey
diedin2000,beforetheintroductionoftheterms“bigdata”and“InternetofThings”,butone
canbetthathe’dbethefirsttotakeadvantageofthemanyopportunitiesthatthe
“datafication”ofsocietyhasprovided(Bertolucci2013).
Thewealthofdataavailabletodayofferstheunprecedentedopportunityto
conductscienceinwaysneverbeforeenvisioned:real-timescience,real-world
data,andnewmethodsofscientificdiscovery.
Anotherforward-thinkingscholaristhelateJames(Jim)Gray,whoselastpositionwasasa
TechnicalFellowandManageroftheBayAreaResearchCenteratMicrosoft.Inadditiontohis
pioneeringworkonlargedatabasesandtransactionprocessingsystems,Grayintroducedthe
conceptoftheFourthParadigmforscience(Hey,etal.2009).9WhileGray’sinitialfocuswason
exploratoryanalysisanddata-intensivescientificcomputation,theFourthParadigmessentially
representstheradicaltransformationofthescientificmethodfromhypothesis-drivento
hypothesis-generatingscience,asdescribedherein.10Gray’sparadigmwasdevelopedinthe
late1990sandearly2000s,buthisgeniusliesinhisvisionfortoday’sworld,justadecadeor
8
“Farbetteranapproximateanswertotherightquestion,whichisoftenvague,thananexactanswertothe
wrongquestion,whichcanalwaysbemadeprecise.”—JohnTukey,1962,Thefutureofdataanalysis.Annalsof
MathematicalStatistics,33,1–67(quotedonpp.13-14).
9
AfterGray’sdeath,AlexSzalayandcolleaguesformallyintroducedtheconceptofthe“FourthParadigm”,withhis
co-authoredpublicationinthejournalScience(Bell,etal.2009).
10
DuringGray’slastlectureonJanuary11,2007[http://research.microsoft.com/enus/um/people/gray/jimgraytalks.htm],heisquotedasstatingthefollowing:“Theworldofsciencehaschanged,
andthereisnoquestionaboutthis.Thenewmodelisforthedatatobecapturedbyinstrumentsorgeneratedby
simulationsbeforebeingprocessedbysoftwareandfortheresultinginformationorknowledgetobestoredin
computers.Scientistsonlygettolookattheirdatafairlylateinthispipeline.Thetechniquesandtechnologiesfor
suchdata-intensivesciencearesodifferentthatitisworthdistinguishingdata-intensivesciencefrom
computationalscienceasanew,fourthparadigmforscientificexploration.”(Hey,etal.,2009).
RENCIWhitePaperSeries,Vol.3,No.6 13
twolater,wherenearlyeveryscientificdomainismovingtowarddata-intensiveanddataenabledscientificdiscovery.Thisshiftinvolvestraditionalfieldssuchasphysicsandastronomy,
whichhavealwayshadaccesstorichdatasourcesbutarenowfacingunprecedented
computationalandanalyticalchallenges,andemergentfieldssuchasenvironmentalscience,
whichhasonlyrecentlyhadaccesstothewealthofdistributeddataavailablefrom
environmentalsensorsandsatellites.Eventhe“soft”sciencessuchassocialscienceand
politicalsciencearerecognizingthepowerofdataderivedfromsocialmediadata,mobile
devices,and“smartcities”.Moreover,appliedfieldssuchasmedicineareembracingtheFourth
Paradigmandthesemyriadnewdatasources.Forexample,in2011,theNationalResearch
Councilorganizedanadhoccommitteetodevelopaframeworkforaunifiedtaxonomyof
humandiseasethat,whenimplemented,willaccelerateprogresstowardprecisionmedicine,
whichisanewareaofmedicinethataimstopersonalizemedicinethroughtheintegrationof
dataonindividualvariabilityingenes,environment,andlifestyleinordertoidentifythemost
appropriatemedicalmonitoringandtreatmentplanforanygivenpatient(NationalResearch
Council2011).Thecommittee’sworklargelymotivatedPresidentObama’sJanuary2015
announcementduringtheStateoftheUnionAddressonanewNationalInitiativeinPrecision
Medicine(Collins&Varmus2015).
AFrameworkforScientificDiscoveryintheEraofBigData:TheCollaborative
KnowledgeNetwork
Thetimeisrighttowhole-heartedlyembracetheflexibilityandpowerthatnewapproachesto
scientificdiscoveryoffer,whilemaintainingthepowerthatthetraditionalscientificmethod
holds.
Thescientificapproachespresentedinthiswhitepaperareincreasinglybeingadoptedby
scientists;however,thereremainshesitancyabouttheirrolesandlegitimacyinthescientific
process,alackoftraininginandincentivesfortheiruse,andtheneedforadditionalresearchto
tailorandimprovetheseapproachesforscientificdiscovery.Wearguethataprimarychallenge
inmovingtowardsacombinationofhypothesis-anddata-drivenscienceistheintegrationof
approachesatacommunitylevel.
TheNationalResearchCouncil’sframeworktocreateapathtowardpersonalizedmedicine
includestheconceptofaKnowledgeNetworkofDiseaseasafederateddiscoverynetwork
organizedaroundanInformationCommonsofstructuredpatient-centricdatabasedlargelyon
genomicsandincludingpatientmedicalhistory,currentsignsandsymptoms,etc.(National
ResearchCouncil2011).Wesuggestasimilar,butbroaderframeworkforaCollaborative
KnowledgeNetworkforScientificDiscovery(Figure1).Theenvisionednetworkwouldbuild
upontheknowledgebasesthatarealreadybeingdeveloped,butthroughanopen,communitybasedeffortdesignedtolinktheseknowledgebasesandtherebycreateaknowledgenetwork
forscientificdiscovery.Theeffortwouldinvolvenotonlyscientificexperts,butallstakeholders,
includingpolicymakers,industryrepresentatives,andevenordinarycitizens,therebyenabling
nontraditionaldatasourcesanddatatypestodrivescientificdiscovery.Indeed,scientiststoday
haveaccesstoanabundanceofnewdatasourcesanddatatypes,whichwe’veclassifiedas:(1)
RENCIWhitePaperSeries,Vol.3,No.6 14
archetypalorvisiblebigdatasuchasthelarge,well-curated,publicdatasetsavailablefrom
NASAandotherlarge-scaleresearchinitiatives;(2)crowd-sourcedorsupernovabigdatasuch
asthemoment-to-momentdatasetsavailablefromTwitterandtheInternetofThings;and(3)
Figure1.ConceptualframeworkfortheproposedCollaborativeKnowledgeNetworkfor
ScientificDiscovery.
long-tailordarkbigdatasuchasthosehard-to-finddatasetsheldbysmallscientificteamsand
amateurorcitizenscientists.EffortssuchasWikipediaprovideamodelforopen,collaborative,
community-governedknowledgeaccumulation(Cohen2014).Acommunity-drivenand
community-governedknowledgenetworkcouldrevolutionizescientificdiscoveryandmove
scienceindirectionsnotimaginabletoday.
ClosingConsiderations
TheCollaborativeKnowledgeNetworkforScientificDiscovery,asproposedandenvisioned
herein,willrequirenewscientificmethods,approaches,andpoliciesinordertofullyrealizeits
potential.Forexample,issuesrelatedtodataprovenancewillneedtobeaddressed,aswill
issuesrelatedtofundingforongoingdevelopmentandlong-termsustainability.Theintegration
ofknowledgeassertionsgeneratedthroughdifferentscientificapproachesthathavediffering
levelsofcertaintyandexpressivenessremainsafundamentalchallenge,butonewhere
progressisbeingmade.Thesearenottrivialissues,asthecreationofnewknowledgefrom
collectiveknowledge,particularlywhenautomated,yieldscomplicatedissuesregarding
intellectualpropertyandownership.Withmultiplestakeholdersinvolvedintheenvisioned
RENCIWhitePaperSeries,Vol.3,No.6 15
knowledgenetwork(e.g.,academics,industry,government,thepublic),ownershipissueswill
needtoberesolved.Ongoingdevelopmentandlong-termsustainabilitybringrelated
challengesthatlikelywillrequiresignificantup-frontcommunitybuy-in.Complexmodelsfor
provenanceandsustainabilitymayneedtobedevelopedtoaccountforconflictinginterests.
Furthermore,today’sstudentsandworkersaren’tbeingtrainedfortheFourthParadigmof
scienceandoftenlackthecriticalthinkingskillsrequiredtoobjectivelynavigatethroughthe
abundanceofdataavailabletodayandmakemeaningfulinferencesabouttheworldaround
them.Thisgapintrainingandworkforcedevelopmentwillneedtobeaddressedinorderto
ensurethataknowledgenetworkisnotmisused.
Perhapsthemostimportantconsideration,however,isthedevelopmentofnewapproachesto
enablethecriticalevaluationandadjudicationofscientificfindingsinthemosttraditional
sense.For,intheabsenceofvalidautomatedmethodsfordrawingconclusionsregarding
scientific“truths,”scientistswillcontinuetofaceadatadelugewithoutarationalpathtoward
peerconsensus,erroneousknowledgemaybeintroducedandpropagated,andthelaypublic
willlosetrustinscience.Indeed,thepublicalreadyhasatendencytomistrustscience,
especiallywhenscientificfindingsgoagainstintuitionorpersonalbelief(Achenbach2015).
Withoutafacilemethodtodeterminethelegitimacyofscientificfindings,anypublicmistrustin
sciencewillonlyincrease.Intheabsenceofpublictrust,scienceitselfwillsuffer.
Howtoreferencethispaper:
Schmitt,C.,Cox,S.,Fecho,K.,Idaszak,R.,Lander,H.,Rajasekar,A.andThakur,S.(2015):
ScientificDiscoveryintheEraofBigData:MorethantheScientificMethod.RENCI,Universityof
NorthCarolinaatChapelHill.Text.http://dx.doi.org/10.7921/G0C82763
RENCIWhitePaperSeries,Vol.3,No.6 16
References*
Achenbach,J.(2015).Whydomanyreasonablepeopledoubtscience?NationalGeographic.
http://ngm.nationalgeographic.com/2015/03/science-doubters/achenbach-text.
Achilleos,K.G.,Kannas,C.C.,Nicolaou,C.A.,Pattichis,C.S.,&Promponas,V.J.(2012).Open
sourceworkflowsystemsinlifesciencesinformatics.Proceedingsofthe2012IEEE12th
InternationalConferenceonBioinformatics&Bioengineering(BIBE).Larnaca,Cyprus.
Alemida,P.(2013).Theoriginsandpurposeofscientificpublications.JEPSBulletin.
http://blog.efpsa.org/2013/04/30/the-origins-of-scientific-publishing/.
AmericanChemicalSocietyPublications.(2013).Befoundorperish:writingscientific
manuscriptsforthedigitalage.BIOJOURNALS.AmericanChemicalSociety.
http://pubs.acs.org/bio/ACS-Guide-Writing-Manuscripts-for-the-Digital-Age.pdf.
Behrens,J.T.(1997).Principlesandproceduresofexploratorydataanalysis.PsycholMethods.,
2(2),131–160.
Bell,G.,Hey,T,&Szalay,A.(2009).BeyondtheDataDeluge,Science,323(5919),1297-1298.
Berman,H.M.,Gabanyi,M.J.,Groom,C.R.,Johnson,J.E.,Murshudov,G.N.,Nicholls,R.A.,
Reddy,V.,Schwede,T.,Zimmerman,M.D.,Westbrook,J.,Minor,W.(2015).Datato
knowledge:howtogetmeaningfromyourresult.IUCrJ.,2(Part1),45–58.
Bertolucci,J.(2013).Bigdata’snewbuzzword:datafication.InformationWeek.
http://www.informationweek.com/big-data/big-data-analytics/big-datas-new-buzzworddatafication/d/d-id/1108797?.
Betz,F.(2011),OriginofScientificMethod.InManagingScience:Methodologyand
OrganizationofResearch(Innovation,Technology,andKnowledgeManagementSeries,No.9),
byF.Betz.SpringerScience+BusinessMedia.
Bittrich,M.E.(2000).BiographyofJohnWilderTukey.AT&TBellLaboratoriesCommunication.
http://cm.belllabs.com/cm/stat/tukey/bio.html.
Bordes,A.,Weston,J.,Collobert,R.,Bengio,Y.(2009).Learningstructuredembeddingsof
knowledgebases.AssociationfortheAdvancementofArtificialIntelligence,301-306.
http://www.thespermwhale.com/jaseweston/papers/AAAI11.pdf.
Bowers,S.(2012).Scientificworkflow,provenance,anddatamodelingchallengesand
approaches.J.DataSemant.,1(1),19–30.
RENCIWhitePaperSeries,Vol.3,No.6 17
Brillinger,D.R.(2002).JohnW.Tukey:hislifeandprofessionalcontributions.AnnalsStatist.,30
(6),1535–1575.
Burkhardt,F.(1996).CharlesDarwin’sLetters.ASelection.CambridgeUniversityPress.
Buytaert,W.,Baez,S.,Bustamante,M.,Dewulf,A.(2012).Web-basedenvironmental
simulation:bridgingthegapbetweenscientificmodelinganddecision-making.EnvironScience
Technol.,46,1971–1976.
CohenN.Wikipediavs.theSmallScreen.TheNewYorkTimes.February9,2014.
http://www.nytimes.com/2014/02/10/technology/wikipedia-vs-the-small-screen.html?_r=2m.
ColdSpringHarborLaboratory.(2014).Guidetoopenaccess.
http://cshl.libguides.com/content.php?pid=222607&sid=1847688.
Collins,F.S.,Varmus,H.(2015).Anewinitiativeonprecisionmedicine.NEJM,372(9),793–795.
http://www.nejm.org/doi/pdf/10.1056/NEJMp1500523.
Curcin,V.,Ghanem,M.(2008)Scientificworkflowsystems–canonesizefitall?Proceedingsof
the2008IEEE,CIBEC.IEEE.
Diaconis,P.(2011).TheoriesofDataAnalysis:frommagicalthinkingthroughclassicalstatistics.
In:ExploringDataTables,Trends,andShapes,byD.C.Hoaglin,F.Mosteller,andJ.W.Tukey.
JohnWiley&Sons,2011.
Diemer,J.,Alpers,G.W.,Peperkorn,H.M.,Shiban,Y.,Mühlberger,A.(2015).Perceptionand
presenceonemotionalreactions:areviewofresearchinvirtualreality.FrontPsychol.,6,Article
26.
Evans,J.,Wilhelmsen,K.,Berg,J.,Schmitt,C.,Krishnamurthy,A.,Fecho,K.,&Ahalt,S.(2015).
Clinicalgenomics:howmuchdataisenough?.RENCI,UniversityofNorthCarolinaatChapelHill.
Text.doi:10.7921/G0F769G9.http://renci.org/wp-content/uploads/2015/02/0215WhitePaper-CLinicalGenomics-highres.pdf.
Fayyad,U.,Piatetsky-Shapiro,G.,Smyth,P.(1996).Fromdataminingtoknowledgediscoveryin
databases.AIMagazine,Fall1996,37–54.
Gelman,A.(2004).Exploratorydataanalysisforcomplexmodels.J.ComputationalGraphic
Statist.,13(4),755–779.
Gower,B.(1997).Scientificmethod:anhistoricalandphilosophicalintroduction.London,
England:Routledge.
Guo,P.(2013).Datascienceworkflow:overviewandchallenges.CommunicationsoftheACM
blog.http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-andchallenges/fulltext.
RENCIWhitePaperSeries,Vol.3,No.6 18
HeyT,TansleyS,TolleK(Eds.)TheFourthParadigm.Data-IntensiveScientificDiscovery.Seattle,
Washington:MicrosoftCorporation;2009.
Holzinger,A.,Dehmer,M.,Jurisica,I.(2014).Knowledgediscoveryandinteractivedatamining
inbioinformatics–state-of-the-art,futurechallengesandresearchdirections.BMC
Bioinformatics,15(Suppl6),1.
Jarke,M.,Neumann,B.,Vassiliou,Y.,Wahlster,W.(1978).KBMSrequirementsofknowledgebasedsystems.In:Schmidt,J.W.,&Thanos,C.(eds.)FoundationsofKnowledgeBase
Management.ContributionsfromLogic,Databases,andArtificialIntelligence.Berlin,Germany:
Springer,pp.391-395.
http://www.dfki.de/wwdata/Publications/KBMS_Requirements_of_KnowledgeBased_Systems.pdf.
Joppa,L.N.,McInerny,G.,Harper,R.,Salido,L.,Takeda,K.,O’Hara,K.,Gavaghan,D.,Emmo,S.
(2013).Troublingtrendsinscientificsoftwareuse.Science,340(6134),814–815.
Kunkler,K.(2006).Theroleofmedicalsimulation:anoverview.TheInternationalJournalof
MedicalRoboticsandComputerAssistedSurgery,2,203-210.
http://onlinelibrary.wiley.com/doi/10.1002/rcs.101/pdf.
Leonhardt,D.(2000).JohnTukey,85,statistician;coinedtheword‘software’.TheNewYork
Times.http://www.nytimes.com/2000/07/28/us/john-tukey-85-statistician-coined-the-wordsoftware.html.
Manyika,J.,Chu,M.,Bisson,P.,Woetzel,J.,Dobbs,R.,Bughin,J.,Aharon,D.(2015).The
InternetofThings:mappingthevaluebeyondthehype.McKinsey&Company.
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights/Business%20Technology/Unlo
cking%20the%20potential%20of%20the%20Internet%20of%20Things/Unlocking_the_potential
_of_the_Internet_of_Things_Executive_summary.ashx.
McKie,T.HowDarwinwontheevolutionrace.TheGuardian.June21,2008.
http://www.theguardian.com/science/2008/jun/22/darwinbicentenary.evolution.
MiH,MuruganujanA,CasagrandeJT,ThomasPD.(2013).Large-scalegenefunctionanalysis
withthePANTHERclassificationsystem.NatProtoc.,8(8),1551-1566
Montgomery,S.(2009).CharlesDarwin&evolution,1809~2009.Naturalselection.
Cambridge,UK:Christ’sCollege,UniversityofCambridge.
http://darwin200.christs.cam.ac.uk/pages/index.php?page_id=d3.
NationalResearchCouncil.(2011).TowardPrecisionMedicine.BuildingaKnowledgeNetwork
forBiomedicalResearchandaNewTaxonomyofDisease.Washington,D.C.:TheNational
AademiesPress.ContractNo.N01-0D-4-2139.http://www.nap.edu/catalog/13284/towardprecision-medicine-building-a-knowledge-network-for-biomedical-research.
RENCIWhitePaperSeries,Vol.3,No.6 19
Nickel,M.,Murphy,K.,Tresp,V.,Gabrilovich,E.(2015).Areviewofrelationalmachinelearning
forknowledgegraphs.Frommulti-relationallinkpredictiontoautomatedknowledgegraph
construction.CornellUniversityLibrary:arXiv:1503.00759.http://arxiv.org/abs/1503.00759.
Perra,N.,Goçalves,B.,Pastor-Satorras,R.,Vespignani,A.(2012).Activitydrivenmodelingof
timevaryingnetworks.ScientificReports,2,469,1.
Perraud,J.M.,Bai,Q.,Hebir,D.(2010).Ontheappropriategranularityofactivitiesinascientific
workflowappliedtoanoptimizationproblem.InternationalEnvironmentalModellingand
SoftwareSociety(iEMSs)2010InternationalCongressonEnvironmentalModellingandSoftware
ModellingforEnvironment’sSake,FifthBiennialMeeting.Ottawa,Canada.
Pugh,K.,Prusak,L.Designingeffectiveknowledgenetworks.MITSloanManagementReview.
September12.2013.http://sloanreview.mit.edu/article/designing-effective-knowledgenetworks/.
Reshef,D.N,Reshef,Y.A.,Finucane,H.K.,Grossman,S.R.,McVean,G.,Turnbaugh,P.J.,
Lander,E.S.,Mitzenmacher,M.,Sabeti,P.C.(2011).Detectingnovelassociationsinlarge
datasets.Science,334(6062),1518-1524.
Samuelson,P.A.(1954).Thepuretheoryofpublicexpenditure.TheReviewofEconomicsand
Statistics,36(4),387–389.
http://www.ses.unam.mx/docencia/2007II/Lecturas/Mod3_Samuelson.pdf.
Singh,M.P.,Vouk,M.A.(undated).Scientificworkflows:scientificcomputingmeets
transactionalworkflows.
http://www.csc.ncsu.edu/faculty/mpsingh/papers/databases/workflows/sciworkflows.html.
Smith,D.F.(2014).Abriefhistoryofgamification.EDTECH,FocusonHigherEducation.
http://www.edtechmagazine.com/higher/article/2014/07/brief-history-gamificationinfographic.
Smith,R.(2006).Peerreview:aflawedprocessattheheartofscienceandjournals.JRoyal
SocietyMed.,99(4),178–182.
Southern,M.Googleislookingintowaystoranksitesbasedonaccuracyofinformation.Search
EngineJournal.March4,2015.http://www.searchenginejournal.com/google-is-looking-intoways-to-rank-sites-based-on-accuracy-of-information/127394/#.
Spangler,B.(2003).Adjudication.BeyondIntractability.
http://www.beyondintractability.org/essay/adjudication.
Speed,T.(2011).Mathematics.Acorrelationforthe21stcentury.Science,334(6062),15021503.
RENCIWhitePaperSeries,Vol.3,No.6 20
Trader,T.(2012).Tamingthelongtailofscience.HPCWire.
http://www.hpcwire.com/2012/10/15/taming_the_long_tail_of_science/.
Wilhelmsen,K.,Schmitt,C.&Fecho,K.(2013).Factorsinfluencingdataarchivaloflarge-scale
genomicdatasets:amathematicalformalismtocomprehensivelyevaluatethecosts-benefits
ofarchivinglargedatasets.RENCITechnicalReportSeries,TR-13-03.UniversityofNorth
CarolinaatChapelHill,ChapelHill,NC,USA.doi:10.7921/G0MW2F25.
http://renci.org/technical-reports/factors-influencing-data-archival-of-large-scale-genomicdata-sets/.
Zyda,M.(2005).Fromvisualsimulationtovirtualrealitytogames.Computer,38(9),25–32.
*AllhyperlinkswerelastaccessedonSeptember4,2015.
RENCIWhitePaperSeries,Vol.3,No.6 21
Download