PhD Course on Processing Big Data Feb 2018
Instructor: Claudia Soares, PhD Program in ECE @ Tecnico
Announcements
Slides covering up to dimensionality reduction and homework were posted. Homework due date is now May 8.
New slides posted.
Due to health issues today's class (Mar 9) will be postponed. We will meet as scheduled on Tue, and we will then arrange an alternative date for the missing class.
I have posted the slides for lectures 1 and 2
Fri March 9 our classroom will be occupied, so we will meet extraordinarily at 14:00, room 5.09, North Tower
Classes will be held Tue and Fri, at 11:00, room 4.12, North Tower
Why Processing Big Data?
Nowadays, data are generated in many ways, every time, everywhere: our
online activity, medical records, purchase and travel records,
financial data. The data flows are now larger than the world’s storage
capacity; they are heterogenous, noisy, incomplete — and very
useful.
This course provides frameworks and tools to find the stories
behind this deluge of data.
Data are big.
And so learning depends critically on algorithms that
run linearly with the size of the data.
Data are messy.
They come heterogeneous, noisy, and with missing entries. And so learning needs
to be robust. We will go through tools like
But more importantly, we will put the algorithms to work during the
course, building up to the final Big Data project.
Prerequisites
The course is open to all PhD students familiar with Linear Algebra,
matrix notation, and some highlevel programming (like Python, R,
julia or MATLAB). Familiarity with probability theory will also make
your journey smoother through the semester.
Lecture slides
Introduction to learning
Exploratory Data Analysis
Generalization theory
Principal Component Analysis: A Linear Algebra approach
Probabilistic PCA
Compressed Sensing
Matrix Sketching
Kernel PCA
ISOMAP
Hierarchical Clustering
Assignement Clustering
Spectral Clustering (I and II)
Processing large scale heterogeneous data: Generalized low rank models
Graphical models
Graphical models for heterogeneous data streams
(Tentative) Syllabus
Introduction
Exploratory Data Analysis
Generalization of a learned hypothesis
Limitations of predictive modeling
Big data: machine learning for massive datasets
Dimensionality reduction
Compressed sensing and sparse recovery;
Clustering;
Regression;
Online learning.
Big messy data: Generalized low rank models
Missing data problem
PCA;
Regularized PCA and solution methods;
Generalized regularization and solution methods;
Matrix completion for big data;
Choosing low rank models;
Fitting low rank models;
Applications.
Heterogeneous nature of data
Generalized loss functions;
Loss functions for abstract data types;
Multidimensional loss functions;
Applications.
Big data flow processing: graph signal processing
Introduction to graphs and their matrices;
Signal variation on a graph and frequency;
Graph filtering;
IIR, FIR filtering on a graph;
Applications.
Grading
Homework (30%), project (45%), final takehome exam (20%), participation (5%)
References
No single text covers the entirety of the course. The following books
will be partially used and also complemented with recent papers.
K. Murphy, “Machine Learning: A Probabilistic Perspective”, MIT Press
T. Hastie, R. Tibshirani, J. Friedman, “The Elements of Statistical Learning”, 2nd edition, 2009
M. Udell, C. Horn, R. Zadeh, and S. Boyd, “Generalized low rank models,” Foundations and Trends in Machine Learning, 2016.
AbuMostafa, Yaser S., Malik MagdonIsmail, and HsuanTien Lin. “Learning from data,” Vol. 4. Singapore: AMLBook, 2012.
