# R - Data Table

A data table is an enhanced data.frame.

data.tables (and data.frames) are internally lists as well, but with all its columns of equal length and with a class attribute.

## 3 - Create

• `data.table` function
```DT = data.table(
ID = c("b","b","b","a","a","c"),
a = 1:6,
b = 7:12,
c = 13:18)```
• `fread()`

## 4 - Syntax

• Take DT, filter rows using i, then calculate j, grouped by by.
```DT[i, j, (by|keyby)]

DT[ i,  j,  by ] # + extra arguments
|   |   |
|   |    -------> grouped by what?
|    -------> what to do?
---> on which rows?```
R SQL
i where (subset, filtering) and order with the `order()` function
j select columns
* the data.table way: `DT[, .(colA, colB)]`
* the data.frame way: `DT[, c(“colA”, “colB”), with = FALSE]`
j compute `DT[, .(sumA = sum(colA), meanB = mean(colB))]`
by group by
keyby group by ordered

### 4.1 - Filter (Subset)

• Columns can be referred to as if they are variables. In data.frames a comma at the end is necessary but not here
`ans <- flights[origin == "JFK" & month == 6L]`
• Indexing. Get the first two rows from flights.
`ans <- flights[1:2]`

### 4.2 - Order

#### 4.2.1 - Without group by

Sort flights first by column origin in ascending order, and then by dest in descending order

`ans <- flights[order(origin, -dest)]`

#### 4.2.2 - With group by

• `data.table` retains the original order of groups (by design) but we can force it with the `keyby` keyword.
```flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
keyby = .(origin, dest, month)]```

`keyby()` is applied after performing the operation, i.e., on the computed result and sets a key after ordering by setting an attribute called `sorted`.

### 4.3 - Select

• Select one column
```# return it as a vector.
ans <- flights[, arr_delay]
# return as a data.table
ans <- flights[, list(arr_delay)]
ans <- flights[, .(arr_delay)] # The point is an alias to list.```
• Select two columns
`ans <- flights[, .(arr_delay, dep_delay)]`
• Specify alias (ie rename them)
`ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]`

### 4.4 - Compute

• How many trips have had total delay < 0?. Computation gets by default the name `Vn` (ie V1, V2,…)
`ans <- flights[, sum((arr_delay + dep_delay) < 0)]`
• Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.
```ans <- flights[origin == "JFK" & month == 6L,
.(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]```
• How many trips have been made in 2014 from “JFK” airport in the month of June?
```ans <- flights[origin == "JFK" & month == 6L, length(dest)] # length = count (can have any argument)
ans <- flights[origin == "JFK" & month == 6L, .N] #.N is an alias for length```

### 4.5 - Grouping

• the number of trips corresponding to each origin airport?
`ans <- flights[, .(.N), by = .(origin)]`

* the number of trips for each origin airport for carrier code “AA”

`ans <- flights[carrier == "AA", .N, by = origin]`
• the total number of trips for each origin, dest pair for carrier code “AA”
`ans <- flights[carrier == "AA", .N, by = .(origin,dest)]`
• the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”
```ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]```
• Expression in the group by statement (output will be two boolean columns)
`ans <- flights[, .N, .(dep_delay>0, arr_delay>0)]`

### 4.6 - Subset of Data (Partition)

`.SD` for Subset of Data:

• is a `data.table` that holds the data for the current group defined using by.
• contains all the columns except the grouping columns by default.
• preserves the original order

Example:

• print
`DT[, print(.SD), by = col]`
• compute on (multiple) columns with lapply.
`DT[, lapply(.SD, mean), by = ID]`
• specify the SD columns (By default, it contains all the columns other than the grouping variables)
```flights[carrier == "AA",                       ## Only on trips with carrier "AA"
lapply(.SD, mean),                     ## compute the mean
by = .(origin, dest, month),           ## for every 'origin,dest,month'
.SDcols = c("arr_delay", "dep_delay")] ## for just those specified in .SDcols```
• the first two rows for each month?
`ans <- flights[, head(.SD, 2), by = month]`

### 4.7 - Chaining

• Group by + order by
`ans <- flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]`
• The expression can also be chained vertically.
```DT[ ...
][ ...
][ ...
]```