CSU2016122 Big Data Analysis - tools and methods

Volume 2015/2016
Content

Se kursusbeskrivelsen på kampagnesiden for Copenhagen Summer University

This course will bring you in the forefront of the newest tools and methods based on cutting edge research and experience.  Big Data is omnipresent from industries to government and is frequently considered a completely new approach to problem solving. While the possibilities are often exaggerated, Big Data does indeed introduce new opportunities and challenges. The ability to analyze and combine large data from different sources has obvious applications, nonetheless, the lack of quality in the data combined with a high entropy means that conventional analysis often fails.

What you will learn
By completing the course you will be able to set up basic Big Data Analysis end-to-end; from retrieving and cleaning the data, to establishing the information level and extracting patterns and finding outliers and to curate the necessary data. Furthermore you will get acquainted with a number of advanced tools.

Course Content
We will use two datasets consistently throughout the course, one using structured data and one using unstructured data. The data will be used to demonstrate the different steps in Big Data Analysis.

The course contains the following methods and tools:

Core elements:

  • Data cleaning. Detecting and correcting (or removing) corrupt or inaccurate records.
  • Statistics methods for very large datasets. Robust methods for very large datasets and data with very large variance.
  • Finding patterns and outliers in Big Data. Which methods can be used to identify sparse patterns in very large datasets, and how to identify data that does not follow the overall pattern for a dataset.
  • Deep Learning. Machine learning methods especially focused on patterns and classification in image based datasets.
  • Systems for Big Data Analysis. Common systems for BDA; Hadoop, PyDisco etc, and hardware systems design for efficient BDA.

Other tools/methods (emphasized depending on participants interest):

  • Selected machine learning algorithms for large-scale data.
  • Random forests.
  • Large-scale exact nearest neighbor search.
  • Data curation. How to select data for long time curation, systems, techniques and standards for data curation.
  • Search Engines and Recommender Systems. The state-of-the-art in ranking models, used by search engines and recommender systems worldwide, is based on probabilistic Language Models. We will cover the basic principles of the Models and provide a tutorial on how to use them on the Indri award-winning information retrieval platform.

We will be working with several programming tools, however all techniques that are covered are easily implemented with all standard data-analysis languages; R, Python. Matlab, etc.

Participants
The course is aimed at people who are already acquainted with data analyses. The course is strictly focused on Big Data Analysis, thus a background in statistics and/or conventional data analysis is assumed. Participants must hold at least a relevant Bachelor level and/or several years of data analysis experience.

Course dates
5 days, 22 – 26 August 2016, 9:00 – 16:30 at the University of Copenhagen, Frederiksberg Campus.

Course directors
Troels C. Petersen, Associate Professor, Niels Bohr Institute, University of Copenhagen
Christian Igel, Professor, Dr. habil. Department of Computer Science, University of Copenhagen

Other course teachers
Christina Lioma, Associate Professor, The Image Section, Department of Computer Science, University of Copenhagen
Joachim Mathiesen, Associate Professor, Biocomplexity, Niels Bohr Institute, University of Copenhagen
Brian Vinter, Professor, eScience, Niels Bohr Institute, University of Copenhagen

Course fee
EUR 2,600/DKK 19,000 excl. Danish VAT. Fee includes teaching, course materials and all meals during the course.

Learning Outcome

See "What will you learn"

See "Course content"
  • Category
  • Hours
  • Class Instruction
  • 35
  • Preparation
  • 10
  • Total
  • 45
Credit
0 ECTS
Type of assessment
Course participation
None
Marking scale
Without assessment
Censorship form
No external censorship
Criteria for exam assesment

None