Spark RDD - (Creation|Construction|Initialization)
About
RDD type
Articles Related
Example
List
One
data = [1,2,3,4,5]
rDD = sc.parallelize(data,4)
No computation occurs with sc.parallelize(). Spark only records how to create the RDD with four partitions
>>>rDD ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229
Several
sc.parallelize([[1,2],[3,4]]).collect()
[[1, 2], [3, 4]]
Key Value
rdd = sc.parallelize([(1, 2), (3, 4)])
RDD: [(1, 2), (3, 4)]
File
distFile = sc.textFile("README.md", 4)
where:
- the first argument is a list of path. Example: /my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file
A rdd is then a list of string (a list of line)
A file can come from
- HDFS,
- a text file,
- Hypertable,
- Amazon S3 Apache Hbase,
- SequenceFiles,
- even a directory or wild card.