Datasets are in many applications getting so large and complex that traditional data processing applications are becoming inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying and information privacy to mention some aspects of the problem.

Course Content

In this course we will give an introduction to data frames; containers for large datasets. It give an overview of how to import and manipulate datasets, and perform various statistical estimations, with focus on how to interpret the datasets properly.


The course requires familiarity with the Numpy, Scipy and Matplotlib libraries, introduced here.

Duration: 2-6 hours

Tools Introduced


Pandas is a software library written for the Python for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.


Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.