Handling of Skew
Handling of Skew
The distribution of tuples to disks may be skewed — that is,
some disks have many tuples, while others may have fewer
tuples.
Types of skew:
Attribute-value skew.
Some values appear in the partitioning attributes of many tuples;
all the tuples with the same value for the partitioning attribute
end up in the same partition.
Can occur with range-partitioning and hash-partitioning.
Partition skew.
With range-partitioning, badly chosen partition vector may assign
too many tuples to some partitions and too few to others.
Less likely with hash-partitioning if a good hash-function is
chosen.
Handling Skew in Range-Partitioning
To create a balanced partitioning vector (assuming partitioning
attribute forms a key of the relation):
Sort the relation on the partitioning attribute.
Construct the partition vector by scanning the relation in sorted order
as follows.
After every 1/nth of the relation has been read, the value of the
partitioning attribute of the next tuple is added to the partition
vector.
n denotes the number of partitions to be constructed.
Duplicate entries or imbalances can result if duplicates are present in
partitioning attributes.
Alternative technique based on histograms used in practice
Handling Skew using Histograms
Balanced partitioning vector can be constructed from histogram in a
relatively straightforward fashion
Assume uniform distribution within each range of the histogram
Histogram can be constructed by scanning relation, or sampling
(blocks containing) tuples of the relation
Handling Skew Using Virtual Processor
Partitioning
Skew in range partitioning can be handled elegantly using virtual
processor partitioning:
create a large number of partitions (say 10 to 20 times the number
of processors)
Assign virtual processors to partitions either in round-robin fashion
or based on estimated cost of processing each virtual partition
Basic idea:
If any normal partition would have been skewed, it is very likely the
skew is spread over a number of virtual partitions
Skewed virtual partitions get spread across a number of processors,
so work gets distributed evenly!
Comments
Post a Comment