Spark - Sort

1 - About

See also: takeOrdered

3 - Example

3.1 - sortByKey

sortByKey() return a new dataset (K, V) pairs sorted by keys in ascending order

rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
rdd2.sortByKey() 
RDD: [(1,'a'), (2,'c'), (1,'b')] → [(1,'a'), (1,'b'), (2,'c')] 

Syntax:

sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda> at 0x7f7a2d8a4488>)

3.1.1 - If Key not unique

3.1.1.1 - Different orders

If the key is not unique, this function can give different order results.

tmp1 = [(1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'delta')]
tmp2 = [(1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'alpha')]
 
oneRDD = sc.parallelize(tmp1)
twoRDD = sc.parallelize(tmp2)
oneSorted = oneRDD.sortByKey(True).collect()
twoSorted = twoRDD.sortByKey(True).collect()
print oneSorted
print twoSorted
assert set(oneSorted) == set(twoSorted)     # Note that both lists have the same elements
assert twoSorted[0][0] < twoSorted.pop()[0] # Check that it is sorted by the keys
assert oneSorted[0:2] != twoSorted[0:2]     # Note that the subset consisting of the first two elements does not match
[(1, u'alpha'), (1, u'epsilon'), (1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]
[(1, u'delta'), (1, u'epsilon'), (1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]
3.1.1.2 - Solution

A better technique is to sort the RDD by both the key and value, which we can do by combining the key and value into a single string and then sorting on that string.

def sortFunction(tuple):
    """ Construct the sort string (does not perform actual sorting)
    Args:
        tuple: (rating, MovieName)
    Returns:
        sortString: the value to sort with, 'rating MovieName'
    """
    key = unicode('%.3f' % tuple[0])
    value = tuple[1]
    return (key + ' ' + value)
 
 
print oneRDD.sortBy(sortFunction, True).collect()
print twoRDD.sortBy(sortFunction, True).collect()
[(1, u'alpha'), (1, u'delta'), (1, u'epsilon'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]
[(1, u'alpha'), (1, u'delta'), (1, u'epsilon'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]

3.2 - sortBy

sortBy sorts the RDD by the given keyfunc

sortBy(keyfunc, ascending=True, numPartitions=None)
db/spark/sort.txt · Last modified: 2017/09/06 20:15 by gerardnico