Spark - Broadcast variables

Spark Pipeline

About

Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures.

  • Enable to efficiently send large read-only values to all of the workers.
  • Saved at workers for use in one or more Spark operations
  • It's like sending a large, read-only lookup table to all the nodes
  • Ship to each worker only once instead of with each task

Broadcast variables allow us to keep a read-only variable cached at a worker.

We can send it to the worker only once instead of having to send it with each task that we perform at that worker.

Now usually, broadcast variables are distributed using very efficient broadcast algorithms.

The most common usage is to give every worker a large data set or table.

API

broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value [1, 2, 3]

Example

Lookup the locations of the call signs on the RDD contactCounts. We load a list of call sign prefixes to country code to support this lookup

Without broadcasting

signPrefixes = loadCallSignTable()

def processSignCount(sign_count, signPrefixes): 
	country = lookupCountry(sign_count[0], signPrefixes) 
	count = sign_count[1] 
	return (country, count) 

countryContactCounts = (contactCounts 
	.map(processSignCount)
	.reduceByKey((lambda x, y: x+ y))
	)

With broadcast

signPrefixes = sc.broadcast(loadCallSignTable())

def processSignCount(sign_count, signPrefixes.value): 
	country = lookupCountry(sign_count[0], signPrefixes) 
	count = sign_count[1] 
	return (country, count) 

countryContactCounts = (contactCounts 
	.map(processSignCount)
	.reduceByKey((lambda x, y: x+ y))
	)

Documentation / Reference





Discover More
Card Puncher Data Processing
Spark

Map reduce and streaming framework in memory. See: . The library entry point of which is also a connection object is called a session (known also as context). Component: DAG scheduler, ...
Spark Pipeline
Spark - Variable

pySpark provides shared variables in two different types. Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures. Accumulators...



Share this page:
Follow us:
Task Runner