Sparklyr - Table

Card Puncher Data Processing

About

One a connection is made, the table (or data frame) is manipulated with R - Dplyr (Data Frame Operations)

Management

Initialize

Local (Load)

  • Load the iris data set into Spark. The new object will be temporary, limited to the current connection to the source.
iris_tbl <- dplyr::copy_to(sc, iris)
flights_tbl <- dplyr::copy_to(sc, nycflights13::flights, "flights")

Remote

flights_tbl <- tbl(sc, from="flights")

List

  • You can see them in the Spark view

Rstudio Spark View Iris

  • List the tables
dplyr::src_tbls(sc)
[1] "iris"

sample_tbl <- dplyr::tbl(sc, from="hivesampletable")
head(sample_tbl)
# Source:   lazy query [?? x 11]
# Database: spark_connection
  clientid querytime market deviceplatform devicemake devicemodel state  country querydwelltime sessionid
  <chr>    <chr>     <chr>  <chr>          <chr>      <chr>       <chr>  <chr>            <dbl>     <dbl>
1 8        18:54:20  en-US  Android        Samsung    SCH-i500    Calif~ United~         13.9           0
2 23       19:19:44  en-US  Android        HTC        Incredible  Penns~ United~        NaN             0
3 23       19:19:46  en-US  Android        HTC        Incredible  Penns~ United~          1.48          0
4 23       19:19:47  en-US  Android        HTC        Incredible  Penns~ United~          0.246         0
5 28       01:37:50  en-US  Android        Motorola   Droid X     Color~ United~         20.3           1
6 28       00:53:31  en-US  Android        Motorola   Droid X     Color~ United~         16.3           0
# ... with 1 more variable: sessionpagevieworder <dbl>

Query

sample_tbl %>% 
  group_by(market) %>%
  summarise(count = n(), queryDwellTime = mean(querydwelltime)) %>%
  filter(count > 20, querydwelltime > 30) %>%
  collect
# A tibble: 11 x 3
   market count queryDwellTime
   <chr>  <dbl>          <dbl>
 1 es-ES     30          110. 
 2 en-CA     71           60.1
 3 en-IN     37           80.7
 4 it-IT     33           80.8
 5 fr-FR     55           45.9
 6 zh-CN    101           45.8
 7 en-GB   1817           82.5
 8 de-DE     52           63.9
 9 da-DK     31           51.8
10 en-AU     53           76.3
11 en-US  57303        27791. 
Warning message:
Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning 

Support

Livy - Message (54876255 bytes) exceeds maximum allowed size (52428800 bytes)

When trying to load a big data frame such as the flights data, you may get

Error in livy_validate_http_response("Failed to invoke livy statement",  : 
  Failed to invoke livy statement (Server error: (500) Internal Server Error): "java.util.concurrent.ExecutionException: io.netty.handler.codec.EncoderException: java.lang.IllegalArgumentException: Message (54876255 bytes) exceeds maximum allowed size (52428800 bytes)."

This maximum size is specified in the parameter. livy.rsc.rpc.max.size. See Configuring the rpc.max.size setting.

It must be set on the system and session scope. Unfortunately, it seems that you can't set it with Sparklyr. There is no conf parameters in the livy_conf function to set it on the session level.





Discover More
Card Puncher Data Processing
R - Sparklyr

An R interface to spark developped by RStudio. install connect manipulate



Share this page:
Follow us:
Task Runner