Scientific Discovery in the Era of Big Data: More than the Scientific Method A RENCI WHITE PAPER Vol. 3, No. 6, November 2015 ScientificDiscoveryintheEraofBigData: MorethantheScientificMethod Authors CharlesP.Schmitt,DirectorofInformaticsandChiefTechnicalOfficer StevenCox,CyberinfrastructureEngagementLead KaramarieFecho,MedicalandScientificWriter RayIdaszak,DirectorofCollaborativeEnvironments HowardLander,SeniorResearchSoftwareDeveloper ArcotRajasekar,ChiefDomainScientistforDataGridTechnologies SidharthThakur,SeniorResearchDataSoftwareDeveloper RenaissanceComputingInstitute UniversityofNorthCarolinaatChapelHill ChapelHill,NC,USA 919-445-9640 RENCIWhitePaperSeries,Vol.3,No.6 1 ATAGLANCE • • • • • Scientificdiscoveryhaslongbeenguidedbythescientificmethod,whichisconsidered tobethe“goldstandard”inscience. Theeraof“bigdata”isincreasinglydrivingtheadoptionofapproachestoscientific discoverythateitherdonotconformtoorradicallydifferfromthescientificmethod. Examplesincludetheexploratoryanalysisofunstructureddatasets,datamining, computermodeling,interactivesimulationandvirtualreality,scientificworkflows,and widespreaddigitaldisseminationandadjudicationoffindingsthroughmeansthatare notrestrictedtotraditionalscientificpublicationandpresentation. Whilethescientificmethodremainsanimportantapproachtoknowledgediscoveryin science,aholisticapproachthatencompassesnewdata-drivenapproachesisneeded, andthiswillnecessitategreaterattentiontothedevelopmentofmethodsand infrastructuretointegrateapproaches. Newapproachestoknowledgediscoverywillbringnewchallenges,however,including theriskofdatadeluge,lossofhistoricalinformation,propagationof“false”knowledge, relianceonautomationandanalysisoverinquiryandinference,andoutdatedscientific trainingmodels. Nonetheless,thetimeisrightforincreasedfocusontheconstructionofCollaborative KnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasources andintegratetraditionalandemergingscientificmethodsandtherebydrivescientific discoveryandapplication. Introduction Knowledgediscoveryinsciencereferstothesystematicprocesswherebyscientistsdrawlogical conclusionsregardingtheworldaroundus,generatenewtheoriesbasedonthoseconclusions, andsharefindingswithotherscientistsandthelaypublic,thusenablingcriticalreviewand consensusbeforenewfindingsareaddedtothecollectivebodyofknowledge.Historically, scientificdiscoveryhasbeenguidedbythescientificmethod,whichdatestoancienttimesand involvesbothaphilosophicalandpracticalapproachtoscience.Indeed,therenowned philosopherAristotle(384-322BC)wasoneofthefirsttoapproachknowledgediscovery throughrigorous,systematicobservation,althoughitwasn’tuntilmillennialaterthatthe scientificmethodwasactuallyformalizedandimplemented,largelythroughtheworkof Copernicus(1473-1543),TychoBrahe(1546-1601),JohannesKepler(1571-1630),GalileoGalilei (1564-1642),ReneDescartes(1596-1765),andIsaacNewton(1643-1727)(Gower1997;Betz 2011). Atfirstglance,thescientificmethodappearstoberelativelysimpleandstraightforward.In sum,themethodinvolvesarepeatingcycleofstandardizedsteps:theprocessbeginswith carefulobservationofthenaturalworld,theframingofaquestiononthebasisofone’s observations,andareviewoftheexistingbodyofknowledgetodetermineifareasonable RENCIWhitePaperSeries,Vol.3,No.6 2 explanationalreadyexists.Assumingthatthequestionremainsopenended,ascientistwill thenformulateahypothesis,designandimplementanexperimenttotestthehypothesis,and analyzetheexperimentalresults.Theanalyticalresultsareusedtoformallyacceptorrejectthe hypothesis.Thehypothesismaythenbemodified,withtheexperimentrepeatedoranew experimentdesigned.Importantly,theresultsarethendisseminatedviapresentationtopeers andpublication,whichallowforadjudication1orpeerconsensusregardingthevalidityofthe scientificfindings.This,inshort,isscientificdiscoveryviathescientificmethod. RevisitingCharlesDarwin Asanyschoolchildwillattest,theworkofCharlesDarwinprovidesanexemplarofthescientificmethodandthe manychallengesinvolvedinscientificdiscovery(Burkhardt1996;McKie2008;Montgomery2009).Darwin’sgenius perhapsliesinhiskeenabilitytoobservethenaturalworld.Oneofhisearliestobservationswasthatsimilar speciesarefoundacrosstheglobeandthatindividualswithinaspeciesarenotidenticalbuthavelocalvariation. Hequestionedwhymultiplespeciesexistinsteadofjustone.Darwincontinuouslyresearchedandreviewedthe existingscientificliterature(i.e.,theestablishedscientificbodyofknowledge). Darwin’sworkwasinfluencedbytheresearchandwritingsofThomasMalthus,whofoundthathumansproduce moreoffspringthanareneededtoreplacethemselvesandspeculatedthatpopulationsizewouldsoonexceedthe availableresourcesrequiredforsurvival.Darwinalsoobservedthatpopulationsofplantsandanimalsstayabout thesamesizebecauseoflimitedresourcesandcompetitionforthoseresources.Darwin’sthinkingalsowas influencedbytheresearchandwritingsofCharlesLyell,whofoundthatsmall,gradualgeologicalprocessescan producelargechangesovertime.Darwinmadeseveralbrilliantinferencesonthebasisofhisscientific observationsandtheworkofMalthus,Lyell,andothersthatledhimtohypothesizethatspecieschangeslowlyin aprocessofevolutionfromacommonancestor.Hethenspentdecadesinobservationalexperimentation, ongoinganalysisofhisscientificfindings,andrefinementofhishypothesisuntilheeventuallyreachedthenow famousconclusionthattheoriginofthespeciesliesinnaturalselection,ortheprocesswherebyindividual variation,coupledwithcompetitionamongindividualsfornaturalresourcesandsocialcooperationamongkinto increaseindividualfitness,determinesdifferencesamongrelatedspecies. Coincidentally,whileDarwinwasrefininghishypothesisanddrawingaconclusion,AlfredR.Wallace,amuchjunior researcherwhowasfamiliarwithDarwin’swork,reachedasimilarconclusion,whichheplannedtopublishand 2 disseminatetoscientificpeers.Awarethathisworkmightgounrecognized, Darwinreachedanagreementwith Wallace,brokeredbyLyellandJosephD.Hooker,andreluctantlypublishedajointscientificmanuscriptin1858. OntheOriginofSpeciesbyNaturalSelectionwasn’tpublisheduntilNovember1859,butonlyasabookchapter, notthefullbookthatDarwinhadintended.Interestingly,thescientificcommunityreactednegativelytoDarwin’s work(e.g.,Gray1860);manyofDarwin’speersfeltthathedidnothavesufficientevidencetoputforthhis hypothesis,whileothersweredisturbedbythetheological,political,andsocialimplications.Thescientificdebate continuedfornearlyacenturyuntilthescientificcommunityreachedadjudicationonDarwin’shypothesisand scientificfindingsandestablishedthetheoryofevolutionbynaturalselection.Ofnote,thelaydebatecontinues today,largelyontheologicalandpoliticalgrounds. 1 “Adjudication”isalegaltermthatreferstodecisionmakinginthepresenceofaneutralthirdpartywhohasthe authoritytodetermineabindingresolutionthroughsomeformofjudgmentoraward.Inscience,adjudication generallyreferstodecisionmakingandconsensusbuildingamongagroupofwidelyregardedscientificexpertson thetopicunderdiscussion(Spangler2003). 2 “Ineversawamorestrikingcoincidence,ifWallacehadmyM.S.sketchwrittenoutin1842hecouldnothave madeabettershortabstract!EvenhistermsnowstandasHeadsofmyChapters…Soallmyoriginality,whateverit mayamountto,willbesmashed.ThoughmyBook,ifitwilleverhaveanyvalue,willnotbedeteriorated;asallthe labourconsistsintheapplicationofthetheory”—CharlesDarwintoCharlesLyell,June18,1858(InCharles Darwin’sLetters.ASelection,F.Burkhardt(Ed.),1996). RENCIWhitePaperSeries,Vol.3,No.6 3 Thescientificmethodhascertaindesirablecharacteristicsthathaveenabledittowithstandthe passageoftime,evenasnewscientifictoolsandtechniqueshavebeenintroducedtothe scientificprocess.Forexample,themethodisobjectiveandremovedfromallpersonaland culturalbiases.Allhypothesesaredevelopedtobeconsistentwithacceptedscientifictruthsat thetimethatthehypothesisisgenerated.Allmeasurementsmustbeobservableandpertinent tothehypothesis.ThehypothesisandallconclusionsmustfollowtheprincipleofOccam’s Razorintermsofparsimony,orsimplicitywithfewassumptions.Thehypothesisalsomustbe falsifiable(andcapableofbeingdisproven).Finally,thescientificfindingsmustbereproducible byotherscientists. Withoutdispute,thescientificmethodremainsthemostcommonandonlyvalidatedapproach toscientificdiscoveryandknowledgeextraction.Infact,drugdevelopmentintheU.S.relies exclusivelyontherandomizedplacebo-controlledclinicaltrial—theexemplarofthescientific method. However,today’sadvancementsindigitalcomputingandstoragecapabilities,coupledwith newmethodsforscientificcommunication,includingsocialmedia,areintroducingnew approachestoscientificdiscovery,eachofwhichbringschallengesandopportunities.Inthis whitepaper,weconsiderhowtheadventofthedigitalageandtoday’sworldof“bigdata”are changingscientificdiscoveryprocesses.WeclosewithaframeworkforCollaborative KnowledgeNetworksforScientificDiscoverydesignedtoleverageexistingdatasourcesand integratetraditionalandnewdata-drivenscientificmethods,allowingforunprecedented advancesinscientificdiscoveryandapplication. Exploratorydatasetsandexploratoryanalysis Today’sscientisthasaccesstonumerouslargedatasetsofrelevancetomultiplescientific domains.Forexample,theNationalOceanicandAtmosphericAdministrationmaintainsseveral databasescontainingdataonclimatepatterns,earthquakes,ozonelevels,andocean temperatures;thesedataareusefultoscientistsinmanyfields,includingenvironmental science,energy,publichealth,andmedicine.Scientistsareincreasinglyaccessingthesedata setsforexploratoryanalysis,whichisanapproachusedtodetermineifageneralhypothesis bearsanymeritand/orifanexperimentaldesignisfeasible.Exploratoryanalysistypically beginswithageneralhypothesisorexperimentaldesignthatisn’twellfleshedoutandthe applicationofavarietyofstatisticalapproachesandvisualizationtechniquestoidentifyand validatedataelementsand/ordetermineifahypothesisistestableusingthedata(Behrens 1997;Gelman2004;Diaconis2011). Forexample,consideraneconomicsresearcherwhoisinterestedinlearningabouttheloaning practicesofalargecreditunion,intermsofthebreakdownofmortgageloansacross socioeconomicsectors.S/herequestsaccesstothecreditunion’sloandatabaseinorderto determinehowthedataarestructuredandhowfine-grainedtheavailablesocioeconomicdata elementsare.Theresearcherdeterminesthatthedatabasecontainsdataelementsontheage, sex,andgrossincomeofloanholders,aswellastheloanamountandthelocationofthe property,butitdoesnotcontaininformationontheethnicityoftheloanholders.The RENCIWhitePaperSeries,Vol.3,No.6 4 researcherusesthisinformationtomodifyhis/herhypothesisforsubsequenttestingusingthe samedata. Exploratoryanalysishasalwaysbeenpartofscientificdiscoveryand,historically,hasbeenused togeneratethedrivingquestionunderlyingascientifichypothesis.However,inthepast,the processwasinformal,slow,andobservation-based(e.g.,detailednotesonthetypesoffoliage identifiedatdifferentaltitudeswithinagivenregion);whereastoday,itisfast,large-scale, data-drivenandofteninvolvesextensiveuseofadvancedstatisticalmethodsandvisualization techniques.Furthermore,largedatasetsarebeinggeneratedstrictlyforexploratoryanalysis, andoftensuchdatasetsincuraconsiderableexpensewithquestionablecost-benefit.3Arecent McKinseyreport(Manyika,etal.2015),forexample,estimatesthatlessthan1%ofdata capturedandstoredfromanoffshoreoilrigequippedwith30,000sensorsisactuallyusedto guideoperations.Thevalueoftherestofthedataremainstobedetermined. Today,exploratoryanalysisisfast,large-scale,data-drivenandofteninvolves extensiveuseofadvancedstatisticalmethodsandvisualizationtechniques. Intermsofbenefits,exploratoryanalysisenablesascientisttoquicklydevelopamorerefined, testablehypothesisandrigorousexperimentaldesignforsubsequenthypothesistesting (Behrens1997;Gelman2004;Diaconis2011).Challengesincludethecostsinvolvedwiththe generationandlong-termstorageofexploratorydatasets.Inaddition,drawingconclusions fromananalysisofexploratorydatasetscanintroduceerrorifthedataelementsarenot describedwithsufficientmetadatatofullyunderstanddatastructureandmeaningorifthe dataelementshaveinherentbiasesduetohowtheywerecollected.Additionalchallenges includeissuesofprivacyandtheinadvertentorintentionalleakageof“sensitive”data, particularlyifascientistdoesnotobtaintherequisiteauthorizationtoaccessadatasetthat containssensitivedataorifsensitivedataareaccidentallyprovidedtoanunauthorizedorthirdpartyuser(Behrens1997). Datamining Dataminingoftenbeginswithoutahypothesisandinvolvestheapplicationoftoolsand techniquesfromstatistics,mathematics,andvisualizationtoidentifypreviouslyunknown patternsandtrendsinadatasetderivedfromoneormoreexistingdatabases(Fayyad,etal. 1996;Holzinger,etal.2014).Whiledataminingissometimesconsideredaformofexploratory dataanalysis(Behrens1997;Diaconis2011),weargueforadistinctionbetweenexploratory dataanalysis,whichenablesascientisttoexploreadatasetintermsofstructureandtypesof dataelements,includingtherelevancyofexternaldatasources(e.g.,literaturecitations)to 3 Thecostsandbenefitsoftheseexploratorydatasetsareanopentopicofdiscussion;forexample,see Wilhelmsen,etal.2013andEvans,etal.2015fordiscussionsontheprosandconsofcapturingandstoringwhole genomeversustargetedgenomesequencingdata. RENCIWhitePaperSeries,Vol.3,No.6 5 dataelements,anddatamining,whichenablesascientisttoidentifypreviouslyunknown patternsinadatasetintermsofrelationshipsbetweendataelements.Wenotethatnew statisticalapproachesarebeingdevelopedtoovercomesomeofthelimitationsofbothtypes ofscientificdiscovery.Forexample,themaximalinformationcoefficient,orMIC,overcomes thelimitationsofPearson’scorrelationcoefficient,r,byallowingfortheidentificationof complex,nonlinearrelationships(e.g.,exponential,periodic)betweendataelementsindata setsofanysize(Reshef,etal.2011).4Supposeamicrobiologistgeneratesageneexpression datasettoidentifygenesinvolvedintheregulationofthecellcycle.Usingtraditionalstatistical approaches,thescientistisabletoidentifygeneswithstrongdeterministicorlinear associationswithdifferentaspectsofthecellcycle(e.g.,agenethatisrequiredforchromatin assembly).UsingapproachessuchastheMIC,thescientistisabletoidentifygeneswith periodicrelationshipswiththecellcycle(e.g.,ageneassociatedwithanestablishedcyclical eventoraneventthatoccursatapreviouslyunknownfrequencyduringthecellcycle). Dataminingbearslittleresemblancetothescientificmethod.Inparticular,dataminingisnot: (1)constrainedtobeobjective;(2)consistentwithacceptedscientifictruths;(3)parsimonious; or(4)falsifiable.Inaddition,allmeasurementsareobservableonlyafterthefact.Nonetheless, dataminingoffersbenefits,includingtheabilitytorelativelyquicklydiscoverpreviously unknownrelationshipsandtherebygenerateanewhypothesisthatcanthenbetestedusing thescientificmethod(Fayyad,etal.1996;Holzinger,etal.2014).Althoughoftencitedasa criticismofdatamining,akeybenefitistheabilitytogeneratemultipleassociationswithina singledataset,whichmayyieldpowerfulnewinformation,especiallywhencombinedwith associationsidentifiedinotherdatasets.Inthisregard,dataminingcanbeviewedasakinto meta-analysisofclinicaltrialdata. Akeybenefitofdataminingistheabilitytogeneratemultipleassociationswithin asingledataset,whichmayyieldpowerfulnewinformation,especiallywhen combinedwithassociationsidentifiedinotherdatasets. Challengesincludethefactthatdataminingrequiresextensivetrainingbeyondtheskillsofa typicaldomainscientist;withoutsuchtraining,ascientistmayidentifypatternsorassociations thatarenotvalidorreproducible(Fayyad,etal.1996;Behrens1997;Diaconis2011;Holzinger, etal.2014).Further,thereisnocommonlyacceptedapproachormethodfordatamining, whichmakesthefieldsomewhatmoreofanartthanascience(Fayyad,etal.1996;Diaconis 2011).Thereisalsoatendencytotreatallfindingsasconclusivewhentheymaybechance 4 Asanaside,wenotethatthetheoreticalfoundationoftheMICliesintheconceptofmutualinformation(MI)in pairsofrandomvariables,whichwasdevelopedbyClaudeShannon,thefounderofinformationtheory,morethan 50yearsago(Speed2011).TheimplementationofMIintopracticeasMICcannotbeachievedthroughmanualor semi-manualcomputationandinsteadrequiressignificantdigitalcomputationalpower—somethingnotavailable untilrecently. RENCIWhitePaperSeries,Vol.3,No.6 6 findings(Fayyad,etal.1996;Diaconis2011).Inaddition,whilethepowerofdatamining increaseswhenmultipleheterogeneousdatasetsareintegratedbeforethedataaremined,so toodoesthecomplexityoftheprocess(Fayyad,etal.1996;Holzinger,etal.2014).Withhighdimensionaldata,multipleapproachesmustbeusedtoanalyzethedata;forexample, sophisticatedvisualizationapproachesmayneedtobecombinedwithtraditionalstatistical approachestodatamining(Fayyad,etal.1996;Diaconis2011;Holzinger,etal.2014). Computermodeling Computermodelinginvolvesconceptual,mathematical,computer-generated,orphysical representationsofreal-worldobjectsorphenomenon(Bowers2012;Buytaert,etal.2012; Perraetal.,2012;Berman,etal.2015).Itisusedtotestahypothesisorobserveand manipulateanobjectorphenomenonthatisotherwisedifficult(orunethical)toobserveand manipulate. Asanexample,considerascientistwhowishestodeterminehowthebeta-amyloidprotein influencesthedevelopmentofAlzheimersDisease.S/hedevelopsasoftwareprogramusing establishedprinciplesofproteinfoldingtovisuallyexplorein3-Ddifferentscenarioswhereby beta-amyloidmayfoldinthebraintoinfluencecognitivefunctionandleadtothedevelopment ofAlzheimersDisease. Modelingrepresentsafundamentalaspectofscientificdiscoveryandhasbeenused throughoutthehistoryofmodernscience(e.g.,anatomicalmodels).Theuseofcomputer modelingisrelativelynew,however,andassuch,computer-drivenmodelingtendstobea highlyspecializedtoolasopposedtoageneraltool.Benefitsofcomputermodelingincludethe additionofapotentiallypowerfultooltotheformerlymanualprocess,particularlywhen modelsincorporatethevaststreamsofdatathatareavailablefromtechnologiessuchas crystallographyandothersophisticatedsourcesofdata(Perra,etal.2012;Berman,etal.2015). Computermodelsbecomeevenmorepowerfulwhentheyaregeneratedusingdataderived frommultidisciplinaryscience(Bowers,etal.2012;Buytaert,etal.2012). Challenges,however,includethefactthatcomputer-generatedmodelsareonlyasgoodasthe underlyingdataandsoftwareprogramsusedtocreatethem(Joppa,etal.2013;Berman,etal. 2015).Inaddition,theriskofintroducingerrorincreasessignificantlywhenmodelsarecreated forpoorlyunderstoodobjectsorphenomenon,whenthedataarequalitativeorotherwise describedinanon-standardizedformat,orwhenmodelsdevelopedinonefieldareapplied (withoutvalidation)toanotherfieldorevensharedbetweendifferentresearchgroupswithin thesamefield(Bowers,etal.2012;Buytaert,etal.2012;Perra,etal.2012;Joppa,etal.2013; Bergman,etal.2015).Onceintroduced,errorstendtopropagateandmayevenamplify (Buytaert,etal.2012).Moreoever,manymodelsarenotdesignedtoworkinadynamic, flexible,user-driven,webenvironment(Buytaert,etal.2012).Finally,modelingcostscanbe quitehigh,sometimeshigherthanthecostsoftheinstrumentsusedtogeneratethedatathat areusedinthemodels(Berman,etal.2015). RENCIWhitePaperSeries,Vol.3,No.6 7 Interactivesimulationandvirtualreality Interactivesimulationandvirtualrealityaresimilartomodelinginthatacomputerisusedto createrealisticscenariosfortestingahypothesisormanipulatinganobjectorphenomenon thatisotherwisedifficult(orunethical)totestormanipulate.However,thedifference,as definedherein,isthatwithinteractivesimulationandvirtualreality,humanbehavior,notthe underlyingsoftwareprogram,guidestheoutcomeofthesimulationorvirtualexperience(Zyda 2005;Kunkler2006;Diemer,etal.2015). Supposeamedicaldevicecompanyhasbeendevelopinganewanesthesiamachine.Inorderto obtainapprovalfromtheU.S.FoodandDrugAdministration,thedevicehastobetestedfor safetyinPhaseIandIIclinicaltrials.Toachievethis,thecompanycreatesa“dummy”patient thatisprogrammedtomimicphysiologicalresponsesduringparticulartypesofsurgeriesina simulatedoperatingroom.APhaseIclinicaltrialisthenconductedinthesimulated environment,usingateamofactualsurgeons,anesthesiologists,andnursesasstudysubjects. Interactivesimulationandvirtualrealitycanbeusedforhypothesistestingaspartofthe scientificmethod,buttheyalsocanbeusedforexploratoryanalysis.Thebenefits includetheabilitytoinvestigatehumanbehaviorinthecontextoftechnologyorcomputerassistedscenariosthatresembletherealworld(Zyda2005;Kunkler2006;Diemer,etal.2015). Thus,theapproachfacilitatesbothteam-basedscienceandscienceonhumanteams.Several challengesexist,however.First,ateamwithexpertiseinbothsoftwaredevelopmentand humanbehaviormustcreatethesoftwareprogramsthatdrivethesimulationsandvirtual scenarios,withcarefulattentiontoeverydetailofthescenario(Diemer,etal.2015).Also, humanbehaviorinasimulatedenvironmentmightnotbeidenticaltohumanbehaviorinthe realworld;atpresent,softwareprogramscannotfullycapturethetruehumanexperience, includingbehaviorssuchasemotionalreactions(Kunkler2006).Finally,technological challengesgrowexponentiallywiththecomplexityofthescenario,andupfrontcostsarehigh (Zyda2005). Scientificworkflows Scientificworkflowsrefertotheabstractstepsrequiredtocompleteaspecificscientifictask. Today,thesearetypicallydesignedasspecializedworkflowmanagementsystemsthatare capableoforchestratingandautomatingmanyofthesteps,includingdataflowand,insome cases,personnel(e.g.,scheduling,accessrights,alertsorreminders),requiredtocompletea specifictaskand/ortestahypothesis(es)(Curcin&Ghanem2008;Perraud,etal.2010; Achilleos,etal.2012;Bowers2012;Guo2013).Theconceptoftheautomatedscientific workflowappearstohaveoriginatedwithanundatedpublicationbySinghandVouk(see references).Scientificworkflowsareusedtoautomatetedious,time-consuming,orhighly complexstepsinexperimentaltestingand/oranalysis. Forinstance,imaginethatabiomedicalresearchcompanyreliesonflowcytometry,a techniquethatenablesthevisualizationandquantificationoftheproteincompositionoflive cells,asacriticalpartofitsdrugdiscoveryeffortsindiabetes.Thecompanydecidesto RENCIWhitePaperSeries,Vol.3,No.6 8 automatemuchoftheprocessandhiresasoftwareengineeringteamtodesignascientific workflowsystemtoautomateandcoordinatethetechnologiesandresearchteamsinvolvedin eachstepoftheprocess,includingisolationofbloodcellsfrompatientswithdiabetes, incubationofcellswithconjugatedantibody(ies),multiplewashes,immunofluorescence visualization,andanalysistodeterminecellularsubpopulations. Thedesignofmostscientificworkflowsystemsmodelstheactualscientificmethod,exceptthat themeasurementsareautomatedandnotalwaysobservablebythescientist.Moreover, scientificworkflowsareincreasinglybeingusedforautomateddeductionandinference,which arestepsthathistoricallyreliedonthescientist.Intheexampleabove,forinstance,theuseof flowcytometrytoidentifycellularsubpopulationsonthebasisoftheirfluorescentprofilecan bemoreofanartthanascience,especiallywhenmultiplefluorescentconjugatesareused;as such,automatingtheprocesscanyielderroneousorinconsistentresults.Thatsaid,scientific workflowsenableascientisttoconductexperimentsmorequickly,efficiently,andwithgreater statisticalpowerandreproducibilityduetothelargedatavolumesthatawell-designed scientificworkflowcanhandleandtheabilitytointegrateheterogeneousdatasources,oftenin realtime(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Guo2013). Scientificworkflowsareincreasinglybeingusedforautomateddeductionand inference,whicharestepsthathistoricallyreliedonthescientist. Whilepotentiallyquitepowerful,thescientistintherealmoftheworkflowisoftendependent onthetechnologyandremovedfromcriticaldecision-makingsteps(e.g.,calculations, incubationtimes,etc.),which,intheabsenceofcarefulsystemdesignandimplementation, threatensthevalidityandreproducibilityofworkflowfindings.Furthermore,unlessthe workflowstepsareclearlyannotatedandopenlyaccessible,apeerreviewerwillbeunableto fullyevaluateandreproducethescientificfindings(Bowers2012).Moreover,workflowdesign canbequitecomplex,involvinghundredsofindividualanalysissteps,largesetsof heterogeneousdata,andmultipleexistingworkflowsthatoftenwerenotdesignedtowork together;thecomplexitypresentsdifficultiesinuser-friendliness,design,andcaptureand recordofprovenance5(Perraud,etal.2010;Achilleos,etal.2012;Bowers2012;Gup2013). Thereisalsoapracticallimittothedegreeofgranularityinworkflowtasks;highlygranular activitiesaretypicallynotfeasibleduetoanegativeimpactonoverallsystemperformance (Perraud,etal.2010).Anadditionalconcernisthatscientificworkflowscanintroduce systematicbiasinthaterrors(e.g.,anincorrectcalculation)maybeintroducedandpropagated indefinitelybecauseautomationmakesitmoredifficulttodetectandcorrecterrors.Finally, workflowstendtobeveryspecializedandoftencannotberealisticallyadaptedfordifferent domains(Curcin&Ghanem2008). 5 “Provenance”referstothecaptureofmetadatathatdescribestheoriginsofthedata,eachstepofdata transformationandanalysis,andarecordofversioningtoidentifyhowup-to-datethedataare(Guo2013). RENCIWhitePaperSeries,Vol.3,No.6 9 Disseminationandadjudication Disseminationandadjudicationofscientificdiscoveriesarecriticalcomponentsofscientific discovery,asthesearetheprocessesthroughwhichnewscientific“truths”areaddedtothe collectivescientificbodyofknowledge(Smith2006;ACSpublications2013;Almeida2013;Cold SpringHarborLaboratory2014).Disseminationhashistoricallybeenachievedthrough presentationtoscientificpeersandpublicationinscientificjournalsinordertoobtaincritical peerreviewandvalidation(orrejection)ofscientificfindingsandinformthescientificandlay communities.Adjudicationisthemethodbywhichaconsensusisreachedamongscientific expertsregardingthevalidityofthescientificfindings(alsoseefootnote1). Asanexampleofthedisseminationandadjudicationprocesses,assumeascientistdeliversa PowerPointpresentationonnewtechnologiesforhydrologyatascientificconferenceonwater safety.S/hereceivesimmediatefeedbackonthepresentationfromcolleaguesinattendanceat thelecture.Onthebasisofthatfeedback,thescientistdecidesthatanadditionalexperimentis requiredbeforehis/herworkisreadyforpublicationinapeer-reviewedjournal.Toprovide anotherexample,imagineascientistintendstopublishhis/hernewceramicsmodelina scientificjournalfocusedonmaterialsscience.Thejournalrequirescriticalpeerreview.6The scientistsubmitsthemanuscripttothejournalforpotentialpublication,anditisreviewedby anonymouspeers.Onthebasisofthepeerreview,theeditordecidesthatbeforethe manuscriptcanbepublished,thetextneedstoberevisedtobetterframetheneedforanew ceramicsmodelinthecontextofexisting,similarmodels. Historically,disseminationandadjudicationhavebeenkeycomponentsofthescientificmethod andthefinalscreenbeforenewscientifictruthsareaddedtothescientificbodyofknowledge. Thebenefitsaretremendousbecausetheprocessenablesscientificfindingstoberigorously evaluatedbyexpertscientists,thusensuringtherobustnessandintegrityofscientificfindings (Smith2006;ACSpublications2013;Almeida2013;ColdSpringHarborLaboratory2014).The challenges,however,aregrowingintheeraofbigdata.Specifically,thedigitalworldis changingthewaythatscientificfindingsaregenerated,presented,published,andcatalogued. Forexample,traditional,paper-based,peer-reviewedscientificjournalsarecompetingwith new,electronic,openaccessscientificjournals7thatmaynotrequireaslow,rigorouspeer review,butonlyaquick,minimaleditorialreviewbeforepublication(Almeida2013;ColdSpring 6 “Peerreview”referstotheprocesswherebyscientificjournalsorscientificfundingorganizationsenlist(forfree) thehelpofcolleagueswithinthesamefieldasthemanuscriptorgrantproposal’sinvestigativeteamtoevaluate thescientificmeritandintegrityofadocument(Smith2006). 7 “Openaccess”scientificjournalsareonlinejournalsthatpermitaccesstoalljournalcontentwithouta subscriptionfee.(ColdSpringHarborLaboratory2014).Thepremiseistoallowordinarycitizenstoaccessall scientificfindings,particularlythosethataregeneratedwithfederalmoney,asa“publicgood”(e.g., http://publicaccess.nih.gov/),muchthesamewaythatfederalsalariesarereleasedtothepublicforevaluation. “Publicgoods”,asdefinedbyeconomists,arethoseitems(commoditiesorservices)thatarebothnon-excludable andnon-rivalrous;examplesincludepublicparks,sewersystems,highways,policeservices,etc.(Samuelson1954). Ironically,manyjournalsrequiretheauthortopayafeeforopenaccesspublication. RENCIWhitePaperSeries,Vol.3,No.6 10 HarborLaboratory2014).Inaddition,manyscientistsarenowpublishingnon–peer-reviewed scientificfindingsinelectronicform,viawebsites,digitalwhitepapersandtechnicalreports, blogs,webcasts,YouTubevideos,etc.(ACSpublications2013;Almeida2013).Thisapproach allowsforrapiddisseminationofscientificfindings,butitbypassesboththepeerreview processforpublicationandthepeerselectionprocessforspeakerpresentationatascientific meeting.Thatsaid,thecurrentpeerreviewprocessisnotwithoutflaw,intermsofbiasand error(Smith2006),soperhapstheshifttodigitalself-publicationwillreformthepeerreview processorkindleanentirelynewformofpeerreview.NewhybridmodelssuchaseGEMs combinepeerreviewwithfreepublicationandopenaccesspermissions;whetherthesemodels arefiscallysustainablehasyettobeseen.Anotherchallengeisthewealthofscientific informationavailabletoday.Thisdisseminationdelugemakesitchallengingtoadjudicate existingfindingsandidentifyimportantnewfindingsthatmaybecomelostinthewealthof availabledata(i.e.,the“longtailofscience”)(Trader2012). Knowledgebases Traditionally,thedisseminationandadjudicationofscientificknowledgeinvolvesocial processes;howeverthestorage,integration,search,andinferenceofknowledgeisrapidly changingtoahybridmodelthatreliesequallyuponhumansandlarge,complex,collectionsof digitalinformationthatincludedata,relationshipsbetweendataelements,humanassertions onthedata,and,insomeinstances,capabilitiesforautomatedcognitiveprocessing: knowledgebases.Theterm“knowledgebase”wasintroducedinthe1970s(Jarke,etal.1978)to distinguishitfromtheexistingdatabaseofthetime,whichessentiallystoreddataintabular formforuseraccessviaquery.Thetermremainslooselydefined,butaknowledgebasecanbe consideredtobeacomputingsysteminwhichassertionsarerepresentedandpersistedwithin anoverarchingontologicalframeworkthatprovidesthesemanticcontextandthatallowsfor queryingandcomputingacrossassertions.Thehumanuserisintegraltotheknowledgederived fromandcontainedwithintheknowledgebase;and,insomecases,thesystemcanbe structuredtoderivenew,automatedassertionsbasedonmachinelearningornewdata.While traditionaldatabaseshaveevolvedtobecometransaction-basedandrelationalandcanbepart ofaknowledgebasesystem,knowledgebasesremaindifferentiatedinthatthehumanuser’s abilitytodynamicallyaccess,addto,andusetheknowledgeisanessentialpartofthedesign andfunctionofthesystem(e.g.,Wikipedia)(Pugh&Prusak2013). Completelyautomatedknowledgebasesareunderdevelopment,butautomatedcurationof humanassertionsisnotyetpossibleandmayneverbe.Nonetheless,emerging knowledgebasesareincorporatingsophisticatedalgorithmsforautomatedandsemiautomatedstructuring,parsing,summarization,retrieval,andvisualizationofinformation. Leading-edgeproductionknowledgebasescurrentlycontain109to1011assertionsminedfroma varietyofunstructuredandsemi-structureddatasources,includingWikipediaandPubMed (Bordes,etal.2015;Southern2015).Prominentcommercialexamplesofknowledgebases includetheIBMWatsonsystemandElsevierPathwayStudio;proprietaryexamplesinclude WalMart’sSocialGenomeandGoogle’sKnowledgeGraph;opensourcesystemsinclude OpenBEL,OpenPHACTS,andDARPA’sDeepDive.OtherexamplesincludetheSemanticWeb andtheLinkedDatamovement. RENCIWhitePaperSeries,Vol.3,No.6 11 Theuseofknowledgebasesinscientificresearchisgrowing.Forinstance,theGenBank knowledgebaseallowsonetosearchforspecificgenesandthenidentifyawealthofcurated informationaboutthatgeneandrelatedgenesandexternaldatasources.TheGeneOntology (GO)knowledgebaseprovidesannotationongenes,biologicalprocesses,cellularcomponents, andmolecularfunctionsandhasbeenprominentintheapplicationofknowledgebasesfornew statisticalapproachessuchasenrichmentanalysis(Mi,etal.2013).Asscientistsbecomemore awareofthebenefitsofknowledgebases,andasthequalityandquantityoftheirassertions increase,theirapplicationinscientificdiscoveryisexpectedtogrow. Emergingknowledgebasesareincorporatingsophisticated algorithmsforautomatedandsemi-automatedstructuring,parsing, summarization,retrieval,andvisualizationofinformation. Majorchallengesinthedevelopmentofknowledgebasesforscientificdiscoveryincludethe costsofhumancuration,intermsoftheamountoftimeandexperiencerequiredtocontribute toaknowledgebaseinameaningfulway,andthelackofincentivesforscientiststocontribute toknowledgebases,especiallythoseperceivedtobeoutsideofascientist’sareaofexpertise. Asexemplifiedintheknowledgebasesystemslistedabove,recentresearchhasadvancedour understandingofhowtoconstructknowledgebases,includingtheincorporationofprobabilistic models,structuredrepresentationofunstructureddata,deep-learningsystems,questionanswerschemas,andnaturallanguageprocessingalgorithms.Recentresearchalsohas advancedourunderstandingofhowtointegrateknowledgefromdifferentsourceswith differingqualityanddifferingsemantics(Southern,2015;Nickel2015).Forexample,DeepDive isbuiltuponaprobabilisticgraphmodelthatallowsfortheencodingofassertionsand relationshipsbetweendataelementsthathaveinaccuracies,suchasthosethatarederived frompredictivemodelsanddata-drivenapproaches,alongwithassertionsthathavehigher validity.Thesystemusesinferenceenginestominetherelationships,aswellasthestrengthof theassertions,inordertoconstructdifferent“worldmodels”uponwhichnewinferencescan bemade.Theseadvancesshowpromise,butthehumanuser,thescientist,remainsthe roadblockinthefurtherdevelopmentandwidespreadapplicationofknowledgebasesfor scientificdiscovery. TheBigPicture:TheFutureofKnowledgeDiscoveryinScience Werecognizethattherehavealwaysbeenscientificapproachesthatdonotrelyonthe scientificmethod,asothershavearguedpreviously(e.g.,Cleland2001).Yet,thescientific methodremainsthe“goldstandard”intermsofscientificdiscovery.Thewealthofdata availabletodayofferstheunprecedentedopportunitytoconductscienceinwaysneverbefore envisioned:real-timescience,real-worlddata,andnewmethodsofscientificdiscovery.In embracingnewapproachestoscientificdiscoverythatarenotnecessarilyalignedwiththe RENCIWhitePaperSeries,Vol.3,No.6 12 scientificmethod,ormaybeevencontradictit,wemustconsiderhowtobestemployandunify thenewapproacheswitheachotherandwiththetraditionalscientificmethod. JohnTukey,consideredbymanytobethemostinfluentialstatisticianinmoderntimes,once stated:“Thebestpartofbeingastatisticianisthatyougettoplayineveryone’sbackyard.” AmongTukey’smanyaccomplishmentsaretheintroductionoftheterms“bit”(in1946)and “software”(in1958)andtheconceptsandmethodsofexploratorydataanalysis,datamining, theTukeyFastFourierTransformation,andavarietyofotherstatisticalandmathematical approachesforscientificdiscovery,manyofthemalsonamedafterhim(Bittrich2000; Leonhardt2000;Brillinger2002).Tukeyalsointroducedtheterm“uncomfortablescience”, whichhasbeendescribedaslyingsomewherebetweenclassicalmathematicalstatisticsand “magicalthinking”(Diaconis2011).Theconceptisthatinmanyinstances,scientificinference canandmustbemadefromintuitionandexploration,ratherthancontrolledexperimentation, usingafiniteamountofpotentiallyrichdatathatareflawedandoftennonreplicable.8Tukey diedin2000,beforetheintroductionoftheterms“bigdata”and“InternetofThings”,butone canbetthathe’dbethefirsttotakeadvantageofthemanyopportunitiesthatthe “datafication”ofsocietyhasprovided(Bertolucci2013). Thewealthofdataavailabletodayofferstheunprecedentedopportunityto conductscienceinwaysneverbeforeenvisioned:real-timescience,real-world data,andnewmethodsofscientificdiscovery. Anotherforward-thinkingscholaristhelateJames(Jim)Gray,whoselastpositionwasasa TechnicalFellowandManageroftheBayAreaResearchCenteratMicrosoft.Inadditiontohis pioneeringworkonlargedatabasesandtransactionprocessingsystems,Grayintroducedthe conceptoftheFourthParadigmforscience(Hey,etal.2009).9WhileGray’sinitialfocuswason exploratoryanalysisanddata-intensivescientificcomputation,theFourthParadigmessentially representstheradicaltransformationofthescientificmethodfromhypothesis-drivento hypothesis-generatingscience,asdescribedherein.10Gray’sparadigmwasdevelopedinthe late1990sandearly2000s,buthisgeniusliesinhisvisionfortoday’sworld,justadecadeor 8 “Farbetteranapproximateanswertotherightquestion,whichisoftenvague,thananexactanswertothe wrongquestion,whichcanalwaysbemadeprecise.”—JohnTukey,1962,Thefutureofdataanalysis.Annalsof MathematicalStatistics,33,1–67(quotedonpp.13-14). 9 AfterGray’sdeath,AlexSzalayandcolleaguesformallyintroducedtheconceptofthe“FourthParadigm”,withhis co-authoredpublicationinthejournalScience(Bell,etal.2009). 10 DuringGray’slastlectureonJanuary11,2007[http://research.microsoft.com/enus/um/people/gray/jimgraytalks.htm],heisquotedasstatingthefollowing:“Theworldofsciencehaschanged, andthereisnoquestionaboutthis.Thenewmodelisforthedatatobecapturedbyinstrumentsorgeneratedby simulationsbeforebeingprocessedbysoftwareandfortheresultinginformationorknowledgetobestoredin computers.Scientistsonlygettolookattheirdatafairlylateinthispipeline.Thetechniquesandtechnologiesfor suchdata-intensivesciencearesodifferentthatitisworthdistinguishingdata-intensivesciencefrom computationalscienceasanew,fourthparadigmforscientificexploration.”(Hey,etal.,2009). RENCIWhitePaperSeries,Vol.3,No.6 13 twolater,wherenearlyeveryscientificdomainismovingtowarddata-intensiveanddataenabledscientificdiscovery.Thisshiftinvolvestraditionalfieldssuchasphysicsandastronomy, whichhavealwayshadaccesstorichdatasourcesbutarenowfacingunprecedented computationalandanalyticalchallenges,andemergentfieldssuchasenvironmentalscience, whichhasonlyrecentlyhadaccesstothewealthofdistributeddataavailablefrom environmentalsensorsandsatellites.Eventhe“soft”sciencessuchassocialscienceand politicalsciencearerecognizingthepowerofdataderivedfromsocialmediadata,mobile devices,and“smartcities”.Moreover,appliedfieldssuchasmedicineareembracingtheFourth Paradigmandthesemyriadnewdatasources.Forexample,in2011,theNationalResearch Councilorganizedanadhoccommitteetodevelopaframeworkforaunifiedtaxonomyof humandiseasethat,whenimplemented,willaccelerateprogresstowardprecisionmedicine, whichisanewareaofmedicinethataimstopersonalizemedicinethroughtheintegrationof dataonindividualvariabilityingenes,environment,andlifestyleinordertoidentifythemost appropriatemedicalmonitoringandtreatmentplanforanygivenpatient(NationalResearch Council2011).Thecommittee’sworklargelymotivatedPresidentObama’sJanuary2015 announcementduringtheStateoftheUnionAddressonanewNationalInitiativeinPrecision Medicine(Collins&Varmus2015). AFrameworkforScientificDiscoveryintheEraofBigData:TheCollaborative KnowledgeNetwork Thetimeisrighttowhole-heartedlyembracetheflexibilityandpowerthatnewapproachesto scientificdiscoveryoffer,whilemaintainingthepowerthatthetraditionalscientificmethod holds. Thescientificapproachespresentedinthiswhitepaperareincreasinglybeingadoptedby scientists;however,thereremainshesitancyabouttheirrolesandlegitimacyinthescientific process,alackoftraininginandincentivesfortheiruse,andtheneedforadditionalresearchto tailorandimprovetheseapproachesforscientificdiscovery.Wearguethataprimarychallenge inmovingtowardsacombinationofhypothesis-anddata-drivenscienceistheintegrationof approachesatacommunitylevel. TheNationalResearchCouncil’sframeworktocreateapathtowardpersonalizedmedicine includestheconceptofaKnowledgeNetworkofDiseaseasafederateddiscoverynetwork organizedaroundanInformationCommonsofstructuredpatient-centricdatabasedlargelyon genomicsandincludingpatientmedicalhistory,currentsignsandsymptoms,etc.(National ResearchCouncil2011).Wesuggestasimilar,butbroaderframeworkforaCollaborative KnowledgeNetworkforScientificDiscovery(Figure1).Theenvisionednetworkwouldbuild upontheknowledgebasesthatarealreadybeingdeveloped,butthroughanopen,communitybasedeffortdesignedtolinktheseknowledgebasesandtherebycreateaknowledgenetwork forscientificdiscovery.Theeffortwouldinvolvenotonlyscientificexperts,butallstakeholders, includingpolicymakers,industryrepresentatives,andevenordinarycitizens,therebyenabling nontraditionaldatasourcesanddatatypestodrivescientificdiscovery.Indeed,scientiststoday haveaccesstoanabundanceofnewdatasourcesanddatatypes,whichwe’veclassifiedas:(1) RENCIWhitePaperSeries,Vol.3,No.6 14 archetypalorvisiblebigdatasuchasthelarge,well-curated,publicdatasetsavailablefrom NASAandotherlarge-scaleresearchinitiatives;(2)crowd-sourcedorsupernovabigdatasuch asthemoment-to-momentdatasetsavailablefromTwitterandtheInternetofThings;and(3) Figure1.ConceptualframeworkfortheproposedCollaborativeKnowledgeNetworkfor ScientificDiscovery. long-tailordarkbigdatasuchasthosehard-to-finddatasetsheldbysmallscientificteamsand amateurorcitizenscientists.EffortssuchasWikipediaprovideamodelforopen,collaborative, community-governedknowledgeaccumulation(Cohen2014).Acommunity-drivenand community-governedknowledgenetworkcouldrevolutionizescientificdiscoveryandmove scienceindirectionsnotimaginabletoday. ClosingConsiderations TheCollaborativeKnowledgeNetworkforScientificDiscovery,asproposedandenvisioned herein,willrequirenewscientificmethods,approaches,andpoliciesinordertofullyrealizeits potential.Forexample,issuesrelatedtodataprovenancewillneedtobeaddressed,aswill issuesrelatedtofundingforongoingdevelopmentandlong-termsustainability.Theintegration ofknowledgeassertionsgeneratedthroughdifferentscientificapproachesthathavediffering levelsofcertaintyandexpressivenessremainsafundamentalchallenge,butonewhere progressisbeingmade.Thesearenottrivialissues,asthecreationofnewknowledgefrom collectiveknowledge,particularlywhenautomated,yieldscomplicatedissuesregarding intellectualpropertyandownership.Withmultiplestakeholdersinvolvedintheenvisioned RENCIWhitePaperSeries,Vol.3,No.6 15 knowledgenetwork(e.g.,academics,industry,government,thepublic),ownershipissueswill needtoberesolved.Ongoingdevelopmentandlong-termsustainabilitybringrelated challengesthatlikelywillrequiresignificantup-frontcommunitybuy-in.Complexmodelsfor provenanceandsustainabilitymayneedtobedevelopedtoaccountforconflictinginterests. Furthermore,today’sstudentsandworkersaren’tbeingtrainedfortheFourthParadigmof scienceandoftenlackthecriticalthinkingskillsrequiredtoobjectivelynavigatethroughthe abundanceofdataavailabletodayandmakemeaningfulinferencesabouttheworldaround them.Thisgapintrainingandworkforcedevelopmentwillneedtobeaddressedinorderto ensurethataknowledgenetworkisnotmisused. Perhapsthemostimportantconsideration,however,isthedevelopmentofnewapproachesto enablethecriticalevaluationandadjudicationofscientificfindingsinthemosttraditional sense.For,intheabsenceofvalidautomatedmethodsfordrawingconclusionsregarding scientific“truths,”scientistswillcontinuetofaceadatadelugewithoutarationalpathtoward peerconsensus,erroneousknowledgemaybeintroducedandpropagated,andthelaypublic willlosetrustinscience.Indeed,thepublicalreadyhasatendencytomistrustscience, especiallywhenscientificfindingsgoagainstintuitionorpersonalbelief(Achenbach2015). Withoutafacilemethodtodeterminethelegitimacyofscientificfindings,anypublicmistrustin sciencewillonlyincrease.Intheabsenceofpublictrust,scienceitselfwillsuffer. Howtoreferencethispaper: Schmitt,C.,Cox,S.,Fecho,K.,Idaszak,R.,Lander,H.,Rajasekar,A.andThakur,S.(2015): ScientificDiscoveryintheEraofBigData:MorethantheScientificMethod.RENCI,Universityof NorthCarolinaatChapelHill.Text.http://dx.doi.org/10.7921/G0C82763 RENCIWhitePaperSeries,Vol.3,No.6 16 References* Achenbach,J.(2015).Whydomanyreasonablepeopledoubtscience?NationalGeographic. http://ngm.nationalgeographic.com/2015/03/science-doubters/achenbach-text. Achilleos,K.G.,Kannas,C.C.,Nicolaou,C.A.,Pattichis,C.S.,&Promponas,V.J.(2012).Open sourceworkflowsystemsinlifesciencesinformatics.Proceedingsofthe2012IEEE12th InternationalConferenceonBioinformatics&Bioengineering(BIBE).Larnaca,Cyprus. Alemida,P.(2013).Theoriginsandpurposeofscientificpublications.JEPSBulletin. http://blog.efpsa.org/2013/04/30/the-origins-of-scientific-publishing/. AmericanChemicalSocietyPublications.(2013).Befoundorperish:writingscientific manuscriptsforthedigitalage.BIOJOURNALS.AmericanChemicalSociety. http://pubs.acs.org/bio/ACS-Guide-Writing-Manuscripts-for-the-Digital-Age.pdf. Behrens,J.T.(1997).Principlesandproceduresofexploratorydataanalysis.PsycholMethods., 2(2),131–160. Bell,G.,Hey,T,&Szalay,A.(2009).BeyondtheDataDeluge,Science,323(5919),1297-1298. Berman,H.M.,Gabanyi,M.J.,Groom,C.R.,Johnson,J.E.,Murshudov,G.N.,Nicholls,R.A., Reddy,V.,Schwede,T.,Zimmerman,M.D.,Westbrook,J.,Minor,W.(2015).Datato knowledge:howtogetmeaningfromyourresult.IUCrJ.,2(Part1),45–58. Bertolucci,J.(2013).Bigdata’snewbuzzword:datafication.InformationWeek. http://www.informationweek.com/big-data/big-data-analytics/big-datas-new-buzzworddatafication/d/d-id/1108797?. Betz,F.(2011),OriginofScientificMethod.InManagingScience:Methodologyand OrganizationofResearch(Innovation,Technology,andKnowledgeManagementSeries,No.9), byF.Betz.SpringerScience+BusinessMedia. Bittrich,M.E.(2000).BiographyofJohnWilderTukey.AT&TBellLaboratoriesCommunication. http://cm.belllabs.com/cm/stat/tukey/bio.html. Bordes,A.,Weston,J.,Collobert,R.,Bengio,Y.(2009).Learningstructuredembeddingsof knowledgebases.AssociationfortheAdvancementofArtificialIntelligence,301-306. http://www.thespermwhale.com/jaseweston/papers/AAAI11.pdf. Bowers,S.(2012).Scientificworkflow,provenance,anddatamodelingchallengesand approaches.J.DataSemant.,1(1),19–30. RENCIWhitePaperSeries,Vol.3,No.6 17 Brillinger,D.R.(2002).JohnW.Tukey:hislifeandprofessionalcontributions.AnnalsStatist.,30 (6),1535–1575. Burkhardt,F.(1996).CharlesDarwin’sLetters.ASelection.CambridgeUniversityPress. Buytaert,W.,Baez,S.,Bustamante,M.,Dewulf,A.(2012).Web-basedenvironmental simulation:bridgingthegapbetweenscientificmodelinganddecision-making.EnvironScience Technol.,46,1971–1976. CohenN.Wikipediavs.theSmallScreen.TheNewYorkTimes.February9,2014. http://www.nytimes.com/2014/02/10/technology/wikipedia-vs-the-small-screen.html?_r=2m. ColdSpringHarborLaboratory.(2014).Guidetoopenaccess. http://cshl.libguides.com/content.php?pid=222607&sid=1847688. Collins,F.S.,Varmus,H.(2015).Anewinitiativeonprecisionmedicine.NEJM,372(9),793–795. http://www.nejm.org/doi/pdf/10.1056/NEJMp1500523. Curcin,V.,Ghanem,M.(2008)Scientificworkflowsystems–canonesizefitall?Proceedingsof the2008IEEE,CIBEC.IEEE. Diaconis,P.(2011).TheoriesofDataAnalysis:frommagicalthinkingthroughclassicalstatistics. In:ExploringDataTables,Trends,andShapes,byD.C.Hoaglin,F.Mosteller,andJ.W.Tukey. JohnWiley&Sons,2011. Diemer,J.,Alpers,G.W.,Peperkorn,H.M.,Shiban,Y.,Mühlberger,A.(2015).Perceptionand presenceonemotionalreactions:areviewofresearchinvirtualreality.FrontPsychol.,6,Article 26. Evans,J.,Wilhelmsen,K.,Berg,J.,Schmitt,C.,Krishnamurthy,A.,Fecho,K.,&Ahalt,S.(2015). Clinicalgenomics:howmuchdataisenough?.RENCI,UniversityofNorthCarolinaatChapelHill. Text.doi:10.7921/G0F769G9.http://renci.org/wp-content/uploads/2015/02/0215WhitePaper-CLinicalGenomics-highres.pdf. Fayyad,U.,Piatetsky-Shapiro,G.,Smyth,P.(1996).Fromdataminingtoknowledgediscoveryin databases.AIMagazine,Fall1996,37–54. Gelman,A.(2004).Exploratorydataanalysisforcomplexmodels.J.ComputationalGraphic Statist.,13(4),755–779. Gower,B.(1997).Scientificmethod:anhistoricalandphilosophicalintroduction.London, England:Routledge. Guo,P.(2013).Datascienceworkflow:overviewandchallenges.CommunicationsoftheACM blog.http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-andchallenges/fulltext. RENCIWhitePaperSeries,Vol.3,No.6 18 HeyT,TansleyS,TolleK(Eds.)TheFourthParadigm.Data-IntensiveScientificDiscovery.Seattle, Washington:MicrosoftCorporation;2009. Holzinger,A.,Dehmer,M.,Jurisica,I.(2014).Knowledgediscoveryandinteractivedatamining inbioinformatics–state-of-the-art,futurechallengesandresearchdirections.BMC Bioinformatics,15(Suppl6),1. Jarke,M.,Neumann,B.,Vassiliou,Y.,Wahlster,W.(1978).KBMSrequirementsofknowledgebasedsystems.In:Schmidt,J.W.,&Thanos,C.(eds.)FoundationsofKnowledgeBase Management.ContributionsfromLogic,Databases,andArtificialIntelligence.Berlin,Germany: Springer,pp.391-395. http://www.dfki.de/wwdata/Publications/KBMS_Requirements_of_KnowledgeBased_Systems.pdf. Joppa,L.N.,McInerny,G.,Harper,R.,Salido,L.,Takeda,K.,O’Hara,K.,Gavaghan,D.,Emmo,S. (2013).Troublingtrendsinscientificsoftwareuse.Science,340(6134),814–815. Kunkler,K.(2006).Theroleofmedicalsimulation:anoverview.TheInternationalJournalof MedicalRoboticsandComputerAssistedSurgery,2,203-210. http://onlinelibrary.wiley.com/doi/10.1002/rcs.101/pdf. Leonhardt,D.(2000).JohnTukey,85,statistician;coinedtheword‘software’.TheNewYork Times.http://www.nytimes.com/2000/07/28/us/john-tukey-85-statistician-coined-the-wordsoftware.html. Manyika,J.,Chu,M.,Bisson,P.,Woetzel,J.,Dobbs,R.,Bughin,J.,Aharon,D.(2015).The InternetofThings:mappingthevaluebeyondthehype.McKinsey&Company. http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights/Business%20Technology/Unlo cking%20the%20potential%20of%20the%20Internet%20of%20Things/Unlocking_the_potential _of_the_Internet_of_Things_Executive_summary.ashx. McKie,T.HowDarwinwontheevolutionrace.TheGuardian.June21,2008. http://www.theguardian.com/science/2008/jun/22/darwinbicentenary.evolution. MiH,MuruganujanA,CasagrandeJT,ThomasPD.(2013).Large-scalegenefunctionanalysis withthePANTHERclassificationsystem.NatProtoc.,8(8),1551-1566 Montgomery,S.(2009).CharlesDarwin&evolution,1809~2009.Naturalselection. Cambridge,UK:Christ’sCollege,UniversityofCambridge. http://darwin200.christs.cam.ac.uk/pages/index.php?page_id=d3. NationalResearchCouncil.(2011).TowardPrecisionMedicine.BuildingaKnowledgeNetwork forBiomedicalResearchandaNewTaxonomyofDisease.Washington,D.C.:TheNational AademiesPress.ContractNo.N01-0D-4-2139.http://www.nap.edu/catalog/13284/towardprecision-medicine-building-a-knowledge-network-for-biomedical-research. RENCIWhitePaperSeries,Vol.3,No.6 19 Nickel,M.,Murphy,K.,Tresp,V.,Gabrilovich,E.(2015).Areviewofrelationalmachinelearning forknowledgegraphs.Frommulti-relationallinkpredictiontoautomatedknowledgegraph construction.CornellUniversityLibrary:arXiv:1503.00759.http://arxiv.org/abs/1503.00759. Perra,N.,Goçalves,B.,Pastor-Satorras,R.,Vespignani,A.(2012).Activitydrivenmodelingof timevaryingnetworks.ScientificReports,2,469,1. Perraud,J.M.,Bai,Q.,Hebir,D.(2010).Ontheappropriategranularityofactivitiesinascientific workflowappliedtoanoptimizationproblem.InternationalEnvironmentalModellingand SoftwareSociety(iEMSs)2010InternationalCongressonEnvironmentalModellingandSoftware ModellingforEnvironment’sSake,FifthBiennialMeeting.Ottawa,Canada. Pugh,K.,Prusak,L.Designingeffectiveknowledgenetworks.MITSloanManagementReview. September12.2013.http://sloanreview.mit.edu/article/designing-effective-knowledgenetworks/. Reshef,D.N,Reshef,Y.A.,Finucane,H.K.,Grossman,S.R.,McVean,G.,Turnbaugh,P.J., Lander,E.S.,Mitzenmacher,M.,Sabeti,P.C.(2011).Detectingnovelassociationsinlarge datasets.Science,334(6062),1518-1524. Samuelson,P.A.(1954).Thepuretheoryofpublicexpenditure.TheReviewofEconomicsand Statistics,36(4),387–389. http://www.ses.unam.mx/docencia/2007II/Lecturas/Mod3_Samuelson.pdf. Singh,M.P.,Vouk,M.A.(undated).Scientificworkflows:scientificcomputingmeets transactionalworkflows. http://www.csc.ncsu.edu/faculty/mpsingh/papers/databases/workflows/sciworkflows.html. Smith,D.F.(2014).Abriefhistoryofgamification.EDTECH,FocusonHigherEducation. http://www.edtechmagazine.com/higher/article/2014/07/brief-history-gamificationinfographic. Smith,R.(2006).Peerreview:aflawedprocessattheheartofscienceandjournals.JRoyal SocietyMed.,99(4),178–182. Southern,M.Googleislookingintowaystoranksitesbasedonaccuracyofinformation.Search EngineJournal.March4,2015.http://www.searchenginejournal.com/google-is-looking-intoways-to-rank-sites-based-on-accuracy-of-information/127394/#. Spangler,B.(2003).Adjudication.BeyondIntractability. http://www.beyondintractability.org/essay/adjudication. Speed,T.(2011).Mathematics.Acorrelationforthe21stcentury.Science,334(6062),15021503. RENCIWhitePaperSeries,Vol.3,No.6 20 Trader,T.(2012).Tamingthelongtailofscience.HPCWire. http://www.hpcwire.com/2012/10/15/taming_the_long_tail_of_science/. Wilhelmsen,K.,Schmitt,C.&Fecho,K.(2013).Factorsinfluencingdataarchivaloflarge-scale genomicdatasets:amathematicalformalismtocomprehensivelyevaluatethecosts-benefits ofarchivinglargedatasets.RENCITechnicalReportSeries,TR-13-03.UniversityofNorth CarolinaatChapelHill,ChapelHill,NC,USA.doi:10.7921/G0MW2F25. http://renci.org/technical-reports/factors-influencing-data-archival-of-large-scale-genomicdata-sets/. Zyda,M.(2005).Fromvisualsimulationtovirtualrealitytogames.Computer,38(9),25–32. *AllhyperlinkswerelastaccessedonSeptember4,2015. RENCIWhitePaperSeries,Vol.3,No.6 21