Parallel Databases

March 27, 2012

Parallel Databases
􀂄 Introduction
􀂄 I/O Parallelism
􀂄 Interquery Parallelism
􀂄 Intraquery Parallelism
􀂄 Intraoperation Parallelism
􀂄 Interoperation Parallelism
􀂄 Design of Parallel Systems

Introduction
􀂄 Parallel machines are becoming quite common and affordable
􀃌 Prices of microprocessors, memory and disks have dropped sharply
􀂄 Databases are growing increasingly large
􀃌 large volumes of transaction data are collected and stored for later
analysis.
􀃌 multimedia objects like images are increasingly stored in databases
􀂄 Large-scale parallel database systems increasingly used for:
􀃌 storing large volumes of data
􀃌 processing time-consuming decision-support queries
􀃌 providing high throughput for transaction processing

Parallelism in Databases
􀂄 Data can be partitioned across multiple disks for parallel I/O.
􀂄 Individual relational operations (e.g., sort, join, aggregation) can
be executed in parallel
􀃌 data can be partitioned and each processor can work independently
on its own partition.
􀂄 Queries are expressed in high level language (SQL, translated to
relational algebra)
􀃌 makes parallelization easier.
􀂄 Different queries can be run in parallel with each other.
Concurrency control takes care of conflicts.
􀂄 Thus, databases naturally lend themselves to parallelism.

I/O Parallelism
􀂄 Reduce the time required to retrieve relations from disk by partitioning
􀂄 the relations on multiple disks.
􀂄 Horizontal partitioning – tuples of a relation are divided among many
disks such that each tuple resides on one disk.
􀂄 Partitioning techniques (number of disks = n):
Round-robin:
Send the ith tuple inserted in the relation to disk i mod n.
Hash partitioning:
􀃌 Choose one or more attributes as the partitioning attributes.
􀃌 Choose hash function h with range 0…n - 1
􀃌 Let i denote result of hash function h applied tothe partitioning attribute
value of a tuple. Send tuple to disk i.

I/O Parallelism (Cont.)
􀂄 Partitioning techniques (cont.):
􀂄 Range partitioning:
􀃌 Choose an attribute as the partitioning attribute.
􀃌 A partitioning vector [vo, v1, ..., vn-2] is chosen.
􀃌 Let v be the partitioning attribute value of a tuple. Tuples such that vi
≤ vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with
v ≥ vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute
value of 2 will go to disk 0, a tuple with value 8 will go to disk 1,
while a tuple with value 20 will go to disk2.

Comparison of Partitioning Techniques
􀂄 Evaluate how well partitioning techniques support the following
types of data access:
1.Scanning the entire relation.
2.Locating a tuple associatively – point queries.
􀃌 E.g., r.A = 25.
3.Locating all tuples such that the value of a given attribute lies
within a specified range – range queries.
􀃌 E.g., 10 ≤ r.A < 25.

Comparison of Partitioning Techniques (Cont.)
Round robin:
􀂄 Advantages
􀃌 Best suited for sequential scan of entire relation on each query.
􀃌 All disks have almost an equal number of tuples; retrieval work is
thus well balanced between disks.
􀂄 Range queries are difficult to process
􀃌 No clustering -- tuples are scattered across all disks

Comparison of Partitioning Techniques(Cont.)
Hash partitioning:
􀂄 Good for sequential access
􀃌 Assuming hash function is good, and partitioning attributes form a
key, tuples will be equally distributed between disks
􀃌 Retrieval work is then well balanced between disks.
􀂄 Good for point queries on partitioning attribute
􀃌 Can lookup single disk, leaving others available for answering other
queries.
􀃌 Index on partitioning attribute can be local to disk, making lookup
and update more efficient
􀂄 No clustering, so difficult to answer range queries

Comparison of Partitioning Techniques (Cont.)
Range partitioning:
􀂄 Provides data clustering by partitioning attribute value.
􀂄 Good for sequential access
􀂄 Good for point queries on partitioning attribute: only one disk
needs to be accessed.
􀂄 For range queries on partitioning attribute, one to a few disks
may need to be accessed
− Remaining disks are available for other queries.
− Good if result tuples are from one to a few blocks.
− If many blocks are to be fetched, they are still fetched from one
to a few disks, and potential parallelism in disk access is wasted
􀃌 Example of execution skew.

Partitioning a Relation across Disks
􀂄 If a relation contains only a few tuples which will fit into a single
disk block, then assign the relation to a single disk.
􀂄 Large relations are preferably partitioned across all the available
disks.
􀂄 If a relation consists of m disk blocks and there are n disks
available in the system, then the relation should be allocated
min(m,n) disks.

Search This Blog

world at a GlanCE

Parallel Databases

Comments

Post a Comment

Popular posts from this blog

Handling of Skew

Fragment-and-Replicate Join

USER INTERFACE DESIGN FOR ANNA UNIVERSITY SYLLABUS