Handling of Skew


Handling of Skew
􀂄 The distribution of tuples to disks may be skewed — that is,
some disks have many tuples, while others may have fewer
tuples.
􀂄 Types of skew:
􀃌 Attribute-value skew.
􀂾 Some values appear in the partitioning attributes of many tuples;
all the tuples with the same value for the partitioning attribute
end up in the same partition.
􀂾 Can occur with range-partitioning and hash-partitioning.
􀃌 Partition skew.
􀂾 With range-partitioning, badly chosen partition vector may assign
too many tuples to some partitions and too few to others.
􀂾 Less likely with hash-partitioning if a good hash-function is
chosen.

Handling Skew in Range-Partitioning
􀂄 To create a balanced partitioning vector (assuming partitioning
attribute forms a key of the relation):
􀃌 Sort the relation on the partitioning attribute.
􀃌 Construct the partition vector by scanning the relation in sorted order
as follows.
􀂾 After every 1/nth of the relation has been read, the value of the
partitioning attribute of the next tuple is added to the partition
vector.
􀃌 n denotes the number of partitions to be constructed.
􀃌 Duplicate entries or imbalances can result if duplicates are present in
partitioning attributes.
􀂄 Alternative technique based on histograms used in practice

Handling Skew using Histograms
􀂄 Balanced partitioning vector can be constructed from histogram in a
relatively straightforward fashion
􀂄 Assume uniform distribution within each range of the histogram
􀂄 Histogram can be constructed by scanning relation, or sampling
(blocks containing) tuples of the relation

Handling Skew Using Virtual Processor
Partitioning
􀂄 Skew in range partitioning can be handled elegantly using virtual
processor partitioning:
􀃌 create a large number of partitions (say 10 to 20 times the number
of processors)
􀃌 Assign virtual processors to partitions either in round-robin fashion
or based on estimated cost of processing each virtual partition
􀂄 Basic idea:
􀃌 If any normal partition would have been skewed, it is very likely the
skew is spread over a number of virtual partitions
􀃌 Skewed virtual partitions get spread across a number of processors,
so work gets distributed evenly!



Comments

Popular posts from this blog

Fragment-and-Replicate Join

USER INTERFACE DESIGN FOR ANNA UNIVERSITY SYLLABUS