Query Processing: A Systems Perspective

advertisement
Query Processing and Networking
Infrastructures
Day 1 of 2
Joe Hellerstein
UC Berkeley
Septemer 20, 2002
Two Goals
Day 1: Primer on query processing


Targeted to networking/OS folk
Bias: systems issues
Day 2: Seed some cross-fertilized research


Especially with networking
Thesis: dataflow convergence


query processing and routing
Clearly other resonances here




Dataflow HW architectures
Event-based systems designs
ML and Control Theory
Online Algorithms
(Sub)Space of Possible Topics
Traditional
Relational QP:
Optimization
& Execution
Distributed
& Federated QP
Parallel QP
Adaptive QP
Indexing
Data Reduction
Compression
Boolean Text
Search
Visual Querying &
Data Visualization
Traditional Text
Ranking
Transactional
Storage
& Networking
Data Model
& Query Language
Design
Active DBs
(Trigger Systems)
NFNF Data
Models
(OO, XML,
“Semistructured”)
Online and
Approximate QP
Media Queries,
Feature Extraction &
Similarity Search
Hypertext
Ranking
Data Streams
& Continuous Queries
Statistical Data
Analysis
(“Mining”)
Likely Topics Here
Traditional
Relational QP:
Optimization
& Execution
Distributed
& Federated QP
Parallel QP
Adaptive QP
Indexing
Data Reduction
Compression
Boolean Text
Search
Visual Querying &
Data Visualization
Traditional Text
Ranking
Transactional
Storage
& Networking
Data Model
& Query Language
Design
Active DBs
(Trigger Systems)
NFNF Data
Models
(OO, XML,
“Semistructured”)
Online and
Approximate QP
Media Queries,
Feature Extraction &
Similarity Search
Hypertext
Ranking
Data Streams
& Continuous Queries
Statistical Data
Analysis
(“Mining”)
Plus Some Speculative Ones
Traditional
Relational QP:
Optimization
& Execution
Distributed
& Federated QP
Parallel QP
Content
Routing
Indirection
Architectures
Adaptive QP
Online and
Approximate QP
Indexing
Data Streams
& Continuous Queries
Peer-to-Peer
QP
Sensornet
QP
Boolean Text
Search
Traditional Text
Ranking
Network
Monitoring
Outline
Day 1: Query Processing Crash Course





Intro
Queries as indirection
How do relational databases run queries?
How do search engines run queries?
Scaling up: cluster parallelism and distribution
Day 2: Research Synergies w/Networking





Queries as indirection, revisited
Useful (?) analogies to networking research
Some of our recent research at the seams
Some of your research?
Directions and collective discussion
Getting Off on the Right Foot
Roots: database and IR research
“Top-down” traditions (“applications”)

Usually begins with semantics and models.
Common Misconceptions

Query processing = Oracle or Google.


IR search and DB querying are fundamentally
different



Need not be so heavyweight or monolithic! Many
reusable lessons within
Very similar from a query processing perspective
Many similarities in other data models as well
Querying is a synchronous, interactive process.

Triggers, rules and "continuous queries" not so different
from plain old queries.
So… we’ll go bottom-up
Focus on resuable building blocks
Attempt to be language- and modelagnostic

illustrate with various querying scenarios
Confession: Two Biases
Relational query engines



Most mature and general query technology
Best documented in the literature
Conceptually general enough to “capture” most
all other models/schemes
Everybody does web searches


So it’s both an important app, and an inescapable
usage bias we carry around
It will inform our discussion. Shouldn’t skew it
Lots of other query systems/languages you
can keep in mind as we go

LDAP, DNS, XSL/Xpath/XQuery, Datalog
What Are Queries For? I
Obvious answer: search and analysis over big
data sets

Search: select data of interest



Boolean expressions over content
sometimes with an implicit ordering on results
Analysis: construct new information from base
data.




compute functions over each datum
concatenate related records (join)
partition into groups, summarize (aggregates)
aside: “Mining” vs. “Querying”? As a rule of thumb,
think of mining as WYGIWIGY.
Not the most general, powerful answer…
What Are Queries For? II
Queries bridge a (large!) level of indirection



Declarative programming: what you want, not
how to get it
Easy (er) to express
Allows the “how” to change under the covers


?? !!
A critical issue!
Not just for querying

Method invocation, data update, etc
Motivation for this Indirection
Critical when rates of change differ across
layers:



In particular, when
dapp/dt << denvironment/dt
E.g. DB apps are used for years, decades (!!)
E.g. networked env: high rates of change (??)
DB lit calls this “data independence”
Data Independence: Background
Bad Old Days


Hierarchical and “Network” (yep!) data models
Nesting & pointers mean that apps explicitly
traverse data, become brittle when data layouts
change
Apps with persistent data have slow dapp/dt

And the database environments change faster!




Logical changes to representation (schema)
Physical changes in storage (indexes, layouts, HW)
DBs often shared by multiple apps!
In B.O.D., all apps had to be rewritten on change
It’s a SW Engineering Thing
Analogy: imagine if your C structs were to
survive for decades


you’d keep them very simple
encapsulation to allow future mods
Similar Analogy to NWs


protocol simplicity is good
soft state is good (discourages hardcoded refs to
transient resources)
But the fun systems part follows directly:

Achieve the goal w/respectable performance
over a dynamic execution environment
Codd’s Data Independence
Ted Codd, IBM c. 1969 and forward

Turing award 1981
Two layers of indirection
Applications
Logical Representation
(schema)
Physical Representation
(storage)
Logical
Spanned by views
Independence and query rewriting
Physical
Independence
Spanned by query
optimization and
execution
A More Architectural Picture
Declarative query
over views
Bridges logical
independence
Query Rewriter
Query Processor
Declarative query
over base tables
Bridges physical
independence
Optimizer
Query Plan
(Procedural)
N.B.: This classical QP
architecture raises some
problems. To be revisited!
Executor
Iterator
API
Access Methods
Access Methods & Indexing
Access Methods
Base data access layer
Model: Data stored in unordered collections

Relations, tables, one type per collection
Interface: iterators

Open(predicate) -> cursor






Usually simple predicates: attribute op constant
op usually arithmetic (<, >, =), though we’ll see
extensions (e.g. multi-d ops)
Next(cursor) -> datum (of known type)
Close(cursor)
Insert(datum of correct type)
Delete(cursor)
Typical Access Methods
“Heap” files



unordered array of records
usually sequential on disk
predicates just save cross-layer costs
Traditional Index AMs

B-trees

actually, “B+”-trees: all data at leaves




Can scan across leaves for range search
predicates (<,>,=, between) result in fewer I/Os
random I/Os (at least to find beginning of range)
Linear Hash index

Litwin ‘78. Supports equality predicates only.
This is it for IR and standard relational DBs

Though when IR folks say “indexing”, they sometimes mean
all of query processing
Primary & Secondary Indexes
Directory
Directory
Data
(key, ptr) pairs
Data
Primary & Secondary Indexes
Directory
Directory
Data
(key, ptr) pairs
Data
Directory
An Exotic Forest of Search Trees
Multi-dimensional indexes



For geodata, multimedia search, etc.
Dozens! E.g. R-tree family, disk-based QuadTrees, kdB-trees
And of course “linearizations” with B-trees
Path indexes


For XML and OO path queries
E.g. Xfilter
Etc.



Lots of one-off indexes, often many per workload
No clear winners here
Extensible indexing scheme would be nice
Generalized Search Trees
(GiST)
[Hellerstein et al., VLDB 95]
What is a (tree-based) DB index?
Typically:
A clustering of data into leaf blocks
 Hierarchical summaries
(subtree predicates -- SPs)
for pointers in directory blocks

p1 p2 p3
…
Generalized Search Trees (GiST)
Can realize that abstraction with simple interface:

User registers opaque SP objects with a few methods




Consistent(q, p): should query q traverse subtree?
Penalty(d, p): how bad is it to insert d below p
Union (p1, p2): form SP that includes p1, p2
PickSplit({p1, …, pn}): partition SPs into 2
Tree maintenance, concurrency, recovery all doable
under the covers
Covers many popular multi-dimensional indexes

Most of which had no concurrency/recovery story
http://gist.cs.berkeley.edu
Some Additional Indexing Tricks
[O’Neil/Quass, SIGMOD 97]
Bitmap indexing

Many matches per value in (secondary) index?
Rather than storing pointers to heap file in
leaves, store a bitmap of matches in a (sorted)
heap file.




Only works if file reorg is infrequent
Can make intersection, COUNT, etc. quicker during
query processing
Can mix/match bitmaps and lists in a single index
Works with any (secondary) index with duplicate
matches
“Vertical Partitioning” / “Columnar storage”

Again, for sorted, relatively static files
Query Processing Dataflow
Infrastructures
Dataflow Infrastructure
Dataflow abstraction is very simple


“box-and-arrow” diagrams
(typed) collections of objects flow along edges
Details can be tricky

“Push” or “Pull”?



More to it than that
How do control-flow and
dataflow interact?
Where does the data live?


Don’t want to copy data
If passing pointers, where does
the “real” data live?
Iterators
Most uniprocessor DB engines use iterators



Open() -> cursor
Next(cursor) -> typed record
Close(cursor)
Simple and elegant


Control-flow and dataflow coupled
Familiar single-threaded, procedure-call API

Data refs passed on stack, no buffering
Blocking-agnostic


Works w/blocking ops -- e.g. Sort
Works w/pipelined ops
Note: well-behaved iterators
“come up for air” in inner loops

g
E.g. for interrupt handling
f
R
S
Where is the In-Flight Data?
In standard DBMS, raw data lives in disk format, in
shared Buffer Pool
Iterators pass references to BufPool



A tuple “slot” per iterator input
Never copy along edges of dataflow
Join results are arrays of refs to base tables
Operators may “pin” pages in BufPool




BufPool never replaces pinned pages
Ops should release pins ASAP (esp. across Next() calls!!)
Some operators copy data into their internal state
Can“spill” this state to private disk space
Weaknesses of Simple Iterators
Evolution of uniprocessor archs to parallel archs

esp. “shared-nothing” clusters
Opportunity for pipelined parallelism
Opportunity for partition parallelism

Take a single “box” in the dataflow, and split it across
multiple machines
Problems with iterators in this environment


Spoils pipelined parallelism opportunity
Polling (Next()) across the network is inefficient


Nodes sit idle until polled, and during comm
A blocking producer blocks its consumer
But would like to keep iterator abstraction


Especially to save legacy query processor code
And simplify debugging (single-threaded, synchronous)
Exchange
[Graefe, SIGMOD 90]
Encapsulate partition parallelism &
asynchrony


Keep the iterator API between ops
Exchange operator partitions input data by
content

E.g. join or sort keys
Note basic architectural idea!


Encapsulate dataflow
tricks in operators, leaving
infrastructure untouched
We’ll see this again next
week, e.g. in Eddies
Exchange Internals
Really 2 operators, XIN and XOUT


XIN is “top” of a plan, and pulls,
pushing results to XOUT queue
XOUT spins on its local queue
One thread becomes two


XOUT
Producer graph & XIN
Consumer graph & XOUT
Routing table/fn in XIN
supports partition parallelism

E.g. for || sort, join, etc.
Producer and consumer see
iterator API
Queue + thread barrier turns NWbased “push” into iterator-style
“pull”
XIN
route
Exchange
Exchange Benefits?
Remember Iterator limitations?

“Spoils pipelined parallelism opportunity”


“Polling (Next()) across the network is
inefficient”


solved by Exchange thread boundary
Solved by XIN pushing to XOUT queue
“A blocking producer blocks its consumer”
 Still a problem!
Exchange Limitations
Doesn’t allow consumer work to overlap
w/blocking producers



E.g. streaming data sources, events
E.g. sort, some join algs
Entire consumer graph blocks if XOUT queue
empty


Control flow coupled to dataflow, so XOUT won’t return
without data
Queue is encapsulated from consumer
But …


Note that exchange model is fine for most
traditional DB Query Processing
May need to be extended for new settings…
Fjords
[Madden/Franklin, ICDE 01]
Thread of control per operator
Queues between each operator
Asynch or synch calls


Can do asynch poll-and-yield iteration in each operator (for
both consumer and producer)
Or can do synchronous get_next iteration
Can get traditional behavior
if you want:

Synch polls + queue of size 1


Synch consumer, asynch producer


 Iterators
= Exchange
Asynch calls solve the blocking
problem of Exchange
iterator exchange
Fjords
Disadvantages:

Lots of “threads”




Best done in an event-programming style, not OS threads
Operators really have to “come up for air” (“yield”)
Need to write your own scheduler
Harder to debug
But:


Maximizes flexibility for operators at the
endpoints
Still provides a fairly simple interface for
operator-writers
Basic Relational Operators and
Implementation
Relational Algebra Semantics
Selection: sp(R)

Returns all rows in R that satisfy p
Projection: pC(R)

Returns all rows in R projected to columns in C


In strict relational model, remove duplicate rows
In SQL, preserve duplicates (multiset semantics)
Cartesian Product: R  S
Union: R  S
Difference: R — S

Note: R, S must have matching schemata
Join: R
p
S = sp(R  S)
Missing: Grouping & Aggregation, Sorting
Operator Overview: Basics
Selection


Typically “free”, so “pushed down”
Often omitted from diagrams
Projection

In SQL, typically “free”, so “pushed down”



No duplicate elimination
Always pass the minimal set of columns downstream
Typically omitted from diagrams
Cartesian Product

Unavoidable nested loop to generate output
Union:

Concat, or concat followed by dup. elim.
Operator Overview, Cont.
Unary operators: Grouping & Sorting

Grouping can be done with hash or sort schemes
(as we’ll see)
Binary matching: Joins/Intersections

Alternative algorithms:




Nested loops
Loop with index lookup (Index N.L.)
Sort-merge
Hash Join
Don’t forget: have to write as iterators

Every time you get called with Next(), you adjust
your state and produce an output record
Unary External Hashing
[Bratbergsengen, VLDB 84]
E.g. GROUP BY, DISTINCT
Two hash functions, hc (coarse) and hf (fine)
Two phases:

Phase 1: for each tuple of input, hash via hc into a “spill”
partition to be put on disk

B-1 blocks of memory used to hold output buffers for writing a
block at a time per partition
Original
Relation
OUTPUT
Partitions
1
1
2
INPUT
...
2
hash
function
hc
B-1
BDisk
B main memory buffers
Disk
Unary External Hashing

Phase 2: for each partition, read off disk and hash into a
main-memory hashtable via hf


For distinct, when you find a value already in hashtable,
discard the copy
For GROUP BY, associate some agg state (e.g. running SUM)
with each group in the hash table, and maintain
Partitions
hash
Hash table for partition
Ri (k < B pages)
fn
hf
Result
Output
buffer
Disk
B main memory buffers
External Hashing: Analysis
To utilize memory well in Phase 2, would like
each partition to be ~ B blocks big

Hence works in two phases when B >= |R|


Same req as external sorting!
Else can recursively partition the partitions in
Phase 2
Can be made to pipeline, to adapt nicely to
small data sets, etc.
Hash Join (GRACE)[Fushimi, et al., VLDB 84]
Phase 1: partition each relation on the join key with hc, spilling
to disk
Phase 2:


build each partition of smaller relation into a hashtable via hf
scan matching partition of bigger relation, and for each tuple
probe the hashtable via hf for matches
Would like each partition of smaller relation to fit in memory


So works well if B >= |smaller|
Size of bigger is irrelevant!! (Vs. sort-merge join)
Popular optimization: Hybrid hash join



Partition #0 doesn’t spill -- it builds and probes immediately
Partitions 1 through n use rest of memory for output buffers
[DeWitt/Katz/Olken/Shapiro/Stonebraker/Wood, SIGMOD 84]
Hash-Join
Original
Relations
OUTPUT
1
Partitions
1
2
INPUT
2
hash
function
...
hc
B-1
B-1
Disk
B main memory buffers
Partitions
of R & S
hash
fn
Disk
Hash table for partition
Ri (k < B-1 pages)
hf
hf
Input buffer
for Si
Disk
Join Result
Output
buffer
B main memory buffers
Symmetric Hash Join
[Mikillineni & Su, TOSE 88]
[Wilschut & Apers, PDIS 91]
Pipelining, in-core variant
Build and probe symmetrically

Correctness: Each output tuple
generated when its last-arriving
component appears
Can be extended to out-of-core case


Tukwila [Ives & HaLevy, SIGMOD ‘99]
Xjoin: Spill and read partitions multiple times


Correctness guaranteed by timestamping tuples and partitions
[Urhan & Franklin, DEBull ‘00]
Relational Query Engines
A Basic SQL primer
SELECT [DISTINCT] <output expressions>
FROM <tables>
[WHERE <predicates>]
[GROUP BY <gb-expression>
[HAVING <h-predicates>]]
[ORDER BY <expression>]

Join tables in FROM clause


If GROUP BY, partition results by GROUP



applying predicates in WHERE clause
And maintain aggregate output expressions per group
Delete groups that don’t satisfy HAVING clause
If ORDER BY, sort output accordingly
Examples
Single-table S-F-W

DISTINCT, ORDER BY
Multi-table S-F-W

And self-join
Scalar output expressions
Aggregate output expressions

With and without DISTINCT
Group By
Having
Nested queries

Uncorrelated and correlated
A Dopey Query Optimizer
For each S-F-W query block

Create a plan that:



Forms the cartesian product
of the FROM clause
Applies the WHERE clause
Incredibly inefficient

Huge intermediate results!
Then, as needed:




Apply
Apply
Apply
Apply
spredicates

…
tables
the GROUP BY clause
the HAVING clause
any projections and output expressions
duplicate elimination and/or ORDER BY
An Oracular Query Optimizer
For each possible correct plan:
Run the plan (infinitely fast)
 Measure its performance in reality

Pick the best plan, and run it in reality
A Standard Query Optimizer
Three aspects to the problem
Legal plan space (transformation rules)
 Cost model
 Search Strategy

Plan Space
Many legal algebraic transformations, e.g.:


Cartesian product followed by selection can be rewritten as
join
Join is commutative and associative




Can reorder the join tree arbitrarily
NP-hard to find best join tree in general
Selections should (usually) be “pushed down”
Projections can be “pushed down”
And “physical” choices



Choice of Access Methods
Choice of Join algorithms
Taking advantage of sorted nature of some streams

Complicates Dynamic Programming, as we’ll see
Cost Model & Selectivity
Estimation
Cost of a physical operator can be modeled fairly
accurately:


E.g. number of random and sequential I/Os
Requires metadata about input tables:


Number of rows (cardinality)
Bytes per tuple (physical schema)
In a query pipeline, metadata on intermediate
tables is trickier


Cardinality?
Requires “selectivity” (COUNT) estimation



Wet-finger estimates
Histograms, joint distributions and other summaries
Sampling
Search Strategy
Dynamic Programming


Used in most commercial systems
IBM’s System R [Selinger, et al. SIGMOD 79]
Top-Down



Branch and bound with memoization
Exodus, Volcano & Cascades [Graefe, SIGMOD 87,
ICDE 93, DEBull 95]
Used in a few commercial systems (Microsoft SQL
Server, especially)
Randomized

Simulated Annealing, etc. [Ioannidis & Kang
SIGMOD 90]
Dynamic Programming
Use principle of optimality

Any subtree of the optimal plan is itself optimal for its subexpression
Plans enumerated in N passes (if N relations joined):




Pass 1: Find best 1-relation plan for each relation.
Pass 2: Find best way to join result of each 1-relation plan
(as outer) to another relation. (All 2-relation plans.)
Pass N: Find best way to join result of a (N-1)-relation plan
(as outer) to the N’th relation. (All N-relation plans.)
This gives all left-deep plans. Generalization is easy…
A wrinkle: physical properties (e.g. sort orders)
violate principle of optimality!

Use partial-order dynamic programming

I.e. keep undominated plans at each step -- optimal for each
setting of the physical properties (each “interesting order”)
Relational Architecture Review
Query Parsing
and Optimization
Query Executor
Lock Manager
Access Methods
Log Manager
Buffer Management
Disk Space Management
DB
Text Search
Information Retrieval
A research field traditionally separate from
Databases



Goes back to IBM, Rand and Lockheed in the 50’s
G. Salton at Cornell in the 60’s
Lots of research since then
Products traditionally separate


Originally, document management systems for libraries,
government, law, etc.
Gained prominence in recent years due to web search
Today: simple IR techniques

Show similarities to DBMS techniques you already know
IR vs. DBMS
Seem like very different beasts
IR
DBMS
Imprecise Semantics
Precise Semantics
Keyword search
SQL
Unstructured data format
Structured data
Read-Mostly. Add docs
occasionally
Expect reasonable number
of updates
Page through top k results Generate full answer
Under the hood, not as different as they might seem

But in practice, you have to choose between the 2
IR’s “Bag of Words” Model
Typical IR data model:

Each document is just a bag of words (“terms”)
Detail 1: “Stop Words”



Certain words are considered irrelevant and not placed in the bag
e.g. “the”
e.g. HTML tags like <H1>
Detail 2: “Stemming”


Using English-specific rules, convert words to their basic form
e.g. “surfing”, “surfed” --> “surf”
Detail 3: we may decorate the words with other attributes


E.g. position, font info, etc.
Not exactly “bag of words” after all
Boolean Text Search
Find all documents that match a Boolean
containment expression:

“Windows”
AND (“Glass” OR “Door”)
AND NOT “Microsoft”
Note: query terms are also filtered via
stemming and stop words
When web search engines say “10,000
documents found”, that’s the Boolean search
result size.
Text “Indexes”
When IR folks say “index” or “indexing” …

Usually mean more than what DB people mean
In our terms, both “tables” and indexes



Really a logical schema (i.e. tables)
With a physical schema (i.e. indexes)
Usually not stored in a DBMS

Tables implemented as files in a file system
A Simple Relational Text Index
Create and populate a table
InvertedFile(term string, docID int64)
Build a B+-tree or Hash index on InvertedFile.term


May be lots of duplicate docIDs per term
Secondary index: list compression per term possible
This is often called an “inverted file” or “inverted
index”

Maps from words -> docs

whereas normal files map docs to the words in the doc (?!)
Can now do single-word text search queries
Handling Boolean Logic
How to do “term1” OR “term2”?

Union of two docID sets
How to do “term1” AND “term2”?

Intersection (ID join) of two DocID sets!
How to do “term1” AND NOT “term2”?

Set subtraction

Also a join algorithm
How to do “term1” OR NOT “term2”

Union of “term1” and “NOT term2”.


“Not term2” = all docs not containing term2. Yuck!
Usually forbidden at UI/parser
Refinement: what order to handle terms if you have
many ANDs/NOTs?
“Windows” AND (“Glass” OR “Door”)
AND NOT “Microsoft”
Boolean Search in SQL
(SELECT docID FROM InvertedFile
WHERE word = “window”
INTERSECT
SELECT docID FROM InvertedFile
WHERE word = “glass” OR word = “door”)
EXCEPT
SELECT docID FROM InvertedFile
WHERE word=“Microsoft”
ORDER BY magic_rank()
Really there’s only one query (template) in IR



Single-table selects, UNION, INTERSECT, EXCEPT
Note that INTERSECT is a shorthand for equijoin on a key
Often there’s only one query plan in the system, too!
magic_rank() is the “secret sauce” in the search engines
Fancier: Phrases and “Near”
Suppose you want a phrase

E.g. “Happy Days”
Add a position attribute to the schema:


InvertedFile (term string, docID int64, position int)
Index on term
Enhance join condition in query
Can’t use INTERSECT syntax, but query is nearly the same
SELECT I1.docID
FROM InvertedFile I1, InvertedFile I
WHERE I1.word = “HAPPY”
AND I2.word = “DAYS”
AND I1.docID = I2.docID
AND I2.position - I1.position = 1
ORDER BY magic_rank()

Can relax to “term1” NEAR “term2”

Position < k off
Classical Document Ranking
TF  IDF (Term Freq.  Inverse Doc Freq.)

For each term t in the query


QueryTermRank = #occurrences of t in q
TF
 log((total #docs)/(#docs with this term)) IDF
 normalization-factor
For each doc d in the boolean result


DocTermRank = #occurrences of t in d
 log((total #docs)/(#docs with this term))
 normalization-factor
Rank += DocTermRank*QueryTermRank
Requires more to our schema



InvertedFile (term string, docID int64, position int,
DocTermRank float)
TermInfo(term string, numDocs int)
Can compress DocTermRank non-relationally
This basically works fine for raw text

There are other schemes, but this is the standard
TF
IDF
Some Additional Ranking Tricks
Phrases/Proximity

Ranking function can incorporate position
Query expansion, suggestions

Can keep a similarity matrix on terms, and
expand/modify people’s queries
Document expansion

Can add terms to a doc

E.g. in “anchor text” of refs to the doc
Not all occurrences are created equal

Mess with DocTermRank based on:

Fonts, position in doc (title, etc.)
Hypertext Ranking
Also factor in graph structure




Social Network Theory (Citation Analysis)
“Hubs and Authorities” (Clever), “PageRank”
(Google)
Intuition: recursively weighted in-degrees, outdegrees
Math: eigenvector computation
PageRank sure seems to help

Though word on the street is that other factors
matter as much

Anchor text, title/bold text, etc.
Updates and Text Search
Text search engines are designed to be query-mostly


Deletes and modifications are rare
Can postpone updates (nobody notices, no transactions!)


Can’t afford to go offline for an update?





Updates done in batch (rebuild the index)
Create a 2nd index on a separate machine
Replace the 1st index with the 2nd
Can do this incrementally with a level of indirection
So no concurrency control problems
Can compress to search-friendly, update-unfriendly format
For these reasons, text search engines and DBMSs
are usually separate products


Also, text-search engines tune that one SQL query to death!
The benefits of a special-case workload.
Architectural Comparison
Query Optimization
and Execution
{
Search String Modifier
Relational Operators
Ranking Algorithm
“The Query”
Files and Access Methods
The Access Method
Buffer Management
Disk Space Management
Concurrency
and
Recovery
Needed
DB
DBMS
Buffer ManagementOS
Disk Space Management
DB
Search Engine
}
Simple
DBMS
Revisiting Our IR/DBMS
Distinctions
Data Modeling & Query Complexity

DBMS supports any schema & queries



Requires you to define schema
Complex query language (hard for folks to learn)
Multiple applications at output


RowSet API (cursors)
IR supports only one schema & query



No schema design required (unstructured text)
Trivial query language
Single application behavior

Page through output in rank order, ignore most of output
Storage Semantics


DBMS: online, transactional storage
IR: batch, unreliable storage
Distribution & Parallelism
Roots
Distributed QP vs. Parallel QP

Distributed QP envisioned as a k-node intranet
for k ~= 10


Sound old-fashioned? Think of multiple hosting sites
(e.g. one per continent)
Parallel QP grew out of DB Machine research

All in one room, one administrator
Parallel DBMS architecture options



Shared-Nothing
Shared-Everything
Shared-Disk

Shared-nothing is most general, most scalable
Distributed QP: Semi-Joins
[Bernstein/Goodman ‘79]
Main query processing issue in distributed DB
lit: use semi-joins


R S = pR(R S)
Observe that R S  (R
p(S))
S
Assume each table lives at one site, R is
bigger. To reduce communication:

Ship S’s join columns to R’s site, do semijoin there, ship
result to S’s site for the join
Notes


I’m sloppy about dups in my def’ns above
Semi-joins aren’t always a win

Extra cost estimation task for a distributed optimizer
Bloom Joins
[Babb, TODS 79]
A constant optimization on semi-joins

Idea: (R


p(S))
S is redundant
Semi-join can safely return “false hits” from R
Rather than shipping p(S), ship a superset

A particular kind of lossy set compression allowed
Bloom Filter (B. Bloom, 1970)




Hash each value in a set via k independent hash
functions onto an array of n bits
Check membership correspondingly
By tuning k and n, can control false hit rate
Rediscovered recently in web lit, some new
wrinkles (Mitzenmacher’s compressed B.F.’s,
Rhea’s attenuated B.F.’s)
Sideways Information Passing
These ideas generalize more broadly
COMBINER
Set o’ Stuff
Costly
Set Generator
Sideways Information Passing
These ideas generalize more broadly


E.g. “magic sets” rewriting in datalog & SQL
Tricky to do optimally in those settings, but wins
can be very big
COMBINER
Set o’ Stuff
Costly
Less
Costly
Generator
Set Generator
Parallelism 101 [See DeWitt & Gray, CACM 92]
Pipelined vs. Partitioned

Pipelined typically inter-operator


Nominal benefits in a dataflow
Partition typically intra-operator

E.g. hash join or sort using k nodes
Speedup & Scaleup

Speedup: x=old_time/new_time


Ideal: linear
Scaleup: small_sys_elapsed_small_problem /
big_sys_elapse_big_problem



Ideal: 1
Transaction scaleup: N times as many TPC-C’s for N machines
Batch scaleup: N times as big a DB for a query on N machines
Impediments to Good Parallelism
Startup overheads

Amortized for big queries
Interference

usually the result of unpredictable
communication delays (comm cost, empty
pipelines)
Skew
Of these, skew is the real issue in DBs


“Embarrassingly parallel”
I.e. it works
Data Layout
Horizontal Partitioning
For each table, assign rows to machines by
some key
 Or assign arbitrarily (round-robin)

Vertical Partitioning
Sort table, and slice off columns
 Usually not a parallelism trick


But nice for processing queries on read-mostly
data (projection is free!)
Intra-Operator Parallelism
E.g. for Hash Join
Every site with a horizontal partition of
either R or S fires off a scan thread
 Every storage site reroutes its data among
join nodes based on hash of the join
column(s)
 Upon receipt, each site does local hash
join

Recall Exchange!
Skew Handling
Skew happens

Even when hashing? Yep.
Can pre-sample and/or pre-summarize data
to partition better
Solving skew on the fly is harder


Need to migrate accumulated dataflow state
FLuX: Fault-Tolerant, Load-balancing eXchange
In Current Architectures
All DBMSs can run on shared memory, many
on shared-nothing

The high end belongs to clusters
The biggest web-search engines run on
clusters (Google, Inktomi)


And use pretty textbook DB stuff for Boolean
search
Fun tradeoffs between answer quality and
availability/management here (the Inktomi story)
Precomputations
Views and Materialization
A view is a logical table


A query with a name
In general, not updatable
If to be used often, could be materialized

Pre-compute and/or cache result
Could even choose to do this for common
query sub-expressions

Needn’t require a DBA to say “this is a view”
Challenges in Materialized Views
Three main issues:
Given a workload, which views should be
materialized
 Given a query, how can mat-views be
incorporated into query optimizer
 As base tables are updated, how can views
be incrementally maintained?

See readings book, Gupta & Mumick
Precomputation in IR
Often want to save results of common
queries

E.g. no point re-running “Britney Spears” or
“Harry Potter” as Boolean queries
Can also use as subquery results

E.g. the query “Harry Potter Loves Britney
Spears” can use the “Harry Potter” and “Britney
Spears” results
Constrained version of mat-views


No surprise -- constrained relational workload
And consistency of matview with raw tables is
not critical, so maintenance not such an issue.
Precomputed Aggregates
Aggregation queries work on numerical sets of data

Math tricks apply here


Theme: replace the raw data with small statistical
summaries, get approximate results



Some trivial, some fancy
Histograms, wavelets, samples, dependency-based models,
random projections, etc.
Heavily used in query optimizers for selectivity estimation
(a COUNT aggregate)
Spate of recent work on approximate query processing for
AVG, SUM, etc.
[Garofalakis, Gehrke, Rastogi tutorial, SIGMOD ‘02]
A Taste of Next Week
Query Dataflows Meet NWs
Some more presentation


Indirection in space and time
Thematic similarities, differences in NW/QP


Adaptive QP In Telegraph


Eddies, Stems, FluX
A taste of QP in Sensor Networks


Revisit a NW “classic” through a DB lens
TinyDB (TAG), Directed Diffusion
A taste of QP in p2p

PIER project at Berkeley: parallel QP over DHTs
Presentations from MITers?
Open Discussion
Contact
jmh@cs.berkeley.edu
http://www.cs.berkeley.edu/~jmh
Download