Spark - Broadcast variables

1 - About

Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures.

  • Enable to efficiently send large read-only values to all of the workers.
  • Saved at workers for use in one or more Spark operations
  • It's like sending a large, read-only lookup table to all the nodes
  • Ship to each worker only once instead of with each task

Broadcast variables allow us to keep a read-only variable cached at a worker.

We can send it to the worker only once instead of having to send it with each task that we perform at that worker.

Now usually, broadcast variables are distributed using very efficient broadcast algorithms.

The most common usage is to give every worker a large data set or table.

3 - API

broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value [1, 2, 3]

4 - Example

Lookup the locations of the call signs on the RDD contactCounts. We load a list of call sign prefixes to country code to support this lookup

4.1 - Without broadcasting

signPrefixes = loadCallSignTable()
 
def processSignCount(sign_count, signPrefixes): 
	country = lookupCountry(sign_count[0], signPrefixes) 
	count = sign_count[1] 
	return (country, count) 
 
countryContactCounts = (contactCounts 
	.map(processSignCount)
	.reduceByKey((lambda x, y: x+ y))
	)

4.2 - With broadcast

signPrefixes = sc.broadcast(loadCallSignTable())
 
def processSignCount(sign_count, signPrefixes.value): 
	country = lookupCountry(sign_count[0], signPrefixes) 
	count = sign_count[1] 
	return (country, count) 
 
countryContactCounts = (contactCounts 
	.map(processSignCount)
	.reduceByKey((lambda x, y: x+ y))
	)

5 - Documentation / Reference

db/spark/broadcast.txt ยท Last modified: 2017/09/06 20:15 by gerardnico