R - Read.Table

> Procedural Languages > R

1 - About

The Read.Table function reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

Advertising

3 - Syntax

read.table(
     file, 
     header = FALSE, 
     sep = "", 
     quote = "\"'",
     dec = ".",
     row.names,
     col.names,
     as.is = !stringsAsFactors,
     na.strings = "NA", 
     colClasses = NA, 
     nrows = -1,
     skip = 0, 
     check.names = TRUE, 
     fill = !blank.lines.skip,
     strip.white = FALSE, 
     blank.lines.skip = TRUE,
     comment.char = "#",
     allowEscapes = FALSE,
     flush = FALSE,
     stringsAsFactors = default.stringsAsFactors(),
     fileEncoding = "",
     encoding = "unknown",
     text
     )

where:

  • file can be a file, an Url or a connection.
  • header indicate if the file has a header line
  • sep is a string indicating how the columns are separated
  • colClasses, a character vector indicating the class of each column in the dataset
  • nrows, the number of rows in the dataset
  • comment.char, a character string indicating the comment character
  • skip, the number of lines to skip from the beginning
  • stringsAsFactors, should character variables be coded as factors?

4 - Performance

By default, Read.table will:

  • figure out: colClasses (what type of variable is in each column of the table)
  • check if each line is a comment: comment.char (comment.char = “” disable it)

By giving R all these parameters will make R run faster as it don't need to perform them.

Advertising

5 - Memory

The dataset must no be larger than the amount of your RAM.

1,000,000 rows, 10 columns with numeric data = 1,000,000 * 10 * 8 bytes = 76 Mb

6 - Options

6.1 - colClasses

colClasses = "numeric"

To figure out the classes of each column, you can use this snippets:

mySubsetDataTable = read.table("myFile.txt", nrows = 100)
classes = sapply(mySubsetDataTable, class)
myDataTable = read.table("myFile.txt", colClasses = classes)

6.2 - nrows

See the Linux tool wc on how to calculate the number of lines in a file.

Setting nrows will help with memory usage.