What is a data format?
When working with data, e.g. gene expression, annotations, mutations, analysis results we need a proper way to store it
- in memory while working on it
- on disk for later use
There are different formats one can use
What is a data format?
When working with data, e.g. gene expression, annotations, mutations, analysis results we need a proper way to store it
There are different formats one can use
Pandas DataFrame
import pandas as pd dataset = pd.DataFrame( data={ 'string' : ('apple', 'banana', 'carrot'), 'integer': (0, 1, 2), 'float' : (0.0, 1.1, 2.2), }, )
This DataFrame’s data is in the tidy data format
NumPy arrays
import numpy as np np.array([7, 2, 9, 10]) np.array([ [5.2, 3.0, 4.5], [9.1, 0.1, 0.3] ]) np.array([ [ [1, 4, 7], [2, 9, 7], [1, 3, 0], [9, 6, 9] ], [ [2, 3, 4], [1, 2, 5], [3, 6, 2], [4, 7, 8] ] ])
NumPy arrays are different from Pandas DataFrame:
Question:
Can we store these datasets in a file in a way that keeps the data format intact?
Answer:
We need a file format that supports our chosen data format to do so.
Pandas an NumPy support and integrate many different file formats: e.g.
https://pandas.pydata.org/docs/user_guide/io.html https://numpy.org/doc/stable/reference/routines.io.html
What to look for in a file format?
Remember the following:
Consider the following:
- Type: Text format - Packages needed: numpy, pandas, csv - Space efficiency: Bad - Good for sharing/archival: Yes - Tidy data: - Speed: Bad - Ease of use: Great - Array data: - Speed: Bad - Ease of use: Ok for one/two dimensional data. Bad for anything higher. - Best use cases: Sharing data. Small data. Data that needs to be human-readable.
CSV is the most popular file format, as it is human-readable and easy to share.
But, does not preserve data types. Not standardized.
Pandas
dataset.to_csv('dataset.csv', index=False) dataset_csv = pd.read_csv('dataset.csv')
NumPy
np.savetxt('data_array.csv', data_array) data_array_csv = np.loadtxt('data_array.csv')
CSV
import csv with open('dataset.csv', newline='') as csvfile: csvreader = csv.reader(csvfile, delimiter=',', quotechar='"') for row in csvreader: ... with open('dataset.csv', 'w', newline='') as csvfile: csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"') csvwriter.writerow(['A', 'B', 'C']) ...
- Type: Text format - Packages needed: None (json-module is included with Python). - Space efficiency: Ok. - Good for sharing/archival: OK - Tidy data: - Speed: Ok - Ease of use: Ok - Array data: - Speed: Ok - Ease of use: Ok - Best use cases: Saving Python objects for debugging.
JSON is often used to represent hierarchical data with multiple layers or multiple connections.
JSON is standardized (ISO standard) and preserves data types. However, when you’re working with big data, you rarely want to keep your data in this format.
- Type: Binary format - Packages needed: pandas, pyarrow - Space efficiency: Good - Good for sharing/archival: No - Tidy data: - Speed: Great - Ease of use: Good - Array data: - Speed: - - Ease of use: - - Best use cases: Temporary storage of tidy data.
Feather is a file format for storing data frames quickly. There are libraries for Python, R and Julia
- Type: Binary format - Packages needed: pandas, pyarrow - Space efficiency: Great - Good for sharing/archival: Yes - Tidy data: - Speed: Good - Ease of use: Great - Array data: - Speed: Good - Ease of use: It’s complicated - Best use cases: Working with big datasets in tidy data format.
Parquet is a standardized open-source columnar storage format (C, Java, Python, MATLAB, Julia, etc.)
- Type: Binary format - Packages needed: pandas, PyTables, h5py - Space efficiency: Good for numeric data. - Good for sharing/archival: Yes, if datasets are named well. - Tidy data: - Speed: Ok - Ease of use: Good - Array data: - Speed: Great - Ease of use: Good - Best use cases: Working with big datasets in array data format.
HDF5 is a high performance storage format for storing large amounts of data in multiple datasets in a single file.
- Type: Binary format - Packages needed: None (pickle-module is included with Python). - Space efficiency: Ok. - Good for sharing/archival: No! See warning below. - Tidy data: - Speed: Ok - Ease of use: Ok - Array data: - Speed: Ok - Ease of use: Ok - Best use cases: Saving Python objects for debugging.
Pickle is Python’s own serialization library. It allows you to store Python objects into a binary file.
Attention: Loading pickles that have been provided from untrusted sources is risky as they can contain arbitrary executable code
- Type: Binary format - Packages needed: openpyxl - Space efficiency: Bad. - Good for sharing/archival: Maybe. - Tidy data: - Speed: Bad - Ease of use: Good - Array data: - Speed: Bad - Ease of use: Ok - Best use cases: Sharing data in many fields. Quick data analysis. Manual data entry.
Excel is very popular in social sciences and economics. However, it is not a good format for data science. https://www.kristianbrock.com/post/send-me-data/
Text formats
- pros: - human readable - easy sharing - cons: - poor performance - space usage
Binary formats
- pros: - can represent floating point numbers with full precision - save space - good reading and writing performance - multiple datasets in one file - working with large data - cons: - not all are good for sharing - not human readable
Links and references