# R - Data Table

### Table of Contents

## 1 - About

A data table is an enhanced data.frame.

data.tables (and data.frames) are internally lists as well, but with all its columns of equal length and with a class attribute.

## 2 - Articles Related

Advertising

## 3 - Create

`data.table`

function

DT = data.table( ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)

`fread()`

## 4 - Syntax

- Take DT, filter rows using i, then calculate j, grouped by by.

DT[i, j, (by|keyby)] DT[ i, j, by ] # + extra arguments | | | | | -------> grouped by what? | -------> what to do? ---> on which rows?

R | SQL |
---|---|

i | where (subset, filtering) and order with the `order()` function |

j | select columns * the data.table way: `DT[, .(colA, colB)]` * the data.frame way: `DT[, c(“colA”, “colB”), with = FALSE]` |

j | compute `DT[, .(sumA = sum(colA), meanB = mean(colB))]` |

by | group by |

keyby | group by ordered |

Advertising

### 4.1 - Filter (Subset)

- Columns can be referred to as if they are variables. In data.frames a comma at the end is necessary but not here

ans <- flights[origin == "JFK" & month == 6L]

- Indexing. Get the first two rows from flights.

ans <- flights[1:2]

### 4.2 - Order

#### 4.2.1 - Without group by

Sort flights first by column origin in ascending order, and then by dest in descending order

ans <- flights[order(origin, -dest)]

#### 4.2.2 - With group by

`data.table`

retains the original order of groups (by design) but we can force it with the`keyby`

keyword.

flights[carrier == "AA", .(mean(arr_delay), mean(dep_delay)), keyby = .(origin, dest, month)]

`keyby()`

is applied after performing the operation, i.e., on the computed result and sets a key after ordering by setting an attribute called `sorted`

.

### 4.3 - Select

- Select one column

# return it as a vector. ans <- flights[, arr_delay] # return as a data.table ans <- flights[, list(arr_delay)] ans <- flights[, .(arr_delay)] # The point is an alias to list.

- Select two columns

ans <- flights[, .(arr_delay, dep_delay)]

- Specify alias (ie rename them)

ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

Advertising

### 4.4 - Compute

- How many trips have had total delay < 0?. Computation gets by default the name
`Vn`

(ie V1, V2,…)

ans <- flights[, sum((arr_delay + dep_delay) < 0)]

- Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.

ans <- flights[origin == "JFK" & month == 6L, .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]

- How many trips have been made in 2014 from “JFK” airport in the month of June?

ans <- flights[origin == "JFK" & month == 6L, length(dest)] # length = count (can have any argument) ans <- flights[origin == "JFK" & month == 6L, .N] #.N is an alias for length

### 4.5 - Grouping

- the number of trips corresponding to each origin airport?

ans <- flights[, .(.N), by = .(origin)]

* the number of trips for each origin airport for carrier code “AA”

ans <- flights[carrier == "AA", .N, by = origin]

- the total number of trips for each origin, dest pair for carrier code “AA”

ans <- flights[carrier == "AA", .N, by = .(origin,dest)]

- the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”

ans <- flights[carrier == "AA", .(mean(arr_delay), mean(dep_delay)), by = .(origin, dest, month)]

- Expression in the group by statement (output will be two boolean columns)

ans <- flights[, .N, .(dep_delay>0, arr_delay>0)]

### 4.6 - Subset of Data (Partition)

`.SD`

for Subset of Data:

- is a
`data.table`

that holds the data for the current group defined using by. - contains all the columns except the grouping columns by default.
- preserves the original order

Example:

- print

DT[, print(.SD), by = col]

- compute on (multiple) columns with lapply.

DT[, lapply(.SD, mean), by = ID]

- specify the SD columns (By default, it contains all the columns other than the grouping variables)

flights[carrier == "AA", ## Only on trips with carrier "AA" lapply(.SD, mean), ## compute the mean by = .(origin, dest, month), ## for every 'origin,dest,month' .SDcols = c("arr_delay", "dep_delay")] ## for just those specified in .SDcols

- the first two rows for each month?

ans <- flights[, head(.SD, 2), by = month]