1082007


Course
Obtaining data from the Internet: Data crawling for management research

Faculty
Associate Professor, Jörg Claussen

Course Coordinator
Associate Professor,Jörg Claussen

Prerequisites
NOTE: In view of the Corona crisis and the physical limitations that we all face, the Department of Strategy and Innovation has decided to move on-line with this course. In view of the special circumstances, the course will be offered without a course fee. 

Participants should have a basic knowledge of a programming language (e.g. Python, R, PHP or Java). For participants without such knowledge, there are multiple resources available on the web that they can consult prior to the workshop. Examples will be given mostly in Python.

The workshop aims to address both the needs of beginners to the field of data crawling as well as more experienced users of crawling methods. With the many best-practice recommendations of how to execute data crawling projects that will be shared, the course format ensures that all participants will gain and apply new knowledge.

Aim
An increasing number of papers published in the fields of business and economics rely on data collected from the Internet. Learning the required skills to successfully develop data crawling projects can open many interesting research opportunities for PhD students. While writing a first basic crawler can be completed within a couple lines of code, a lot of pitfalls and difficulties can make successful completion of crawling projects challenging and time-consuming. 

Acquiring these skills will be facilitated through the proposed course. The goal of this course is to provide a good understanding about the possibilities of crawling, while also giving enough time for participants to work on their own project in a hands-on approach where participants actively solve exercises that are directly connected to their own research interests. Examples will include applications in the field of innovation and strategy.

Course content
This course introduces PhD students to some of the key tools, frequent complications, and tips and tricks of web crawling.

The following topics will be covered:

When is crawling useful?
Before starting a crawling project, it is important to carefully assess if crawling is really the appropriate tool to get the desired data. We discuss topics like data size, structure of the data, legal aspects, and technical countermeasures from website owners.

How to determine the observations to download?
One key challenge for many crawling projects is that there is no readily available list of all sub-sites of a domain that should be included in the crawling process. We consider different ways to get around this problem, including the use of site maps, APIs, continuous IDs, or snowball approaches.

How to do the actual crawling?
A first important consideration when setting up a crawling process is if it should be set up as a one-off task or as a repeated process in which the same websites are regularly visited to create a panel. We then look at how crawling actually works by programming first simple crawlers using the language Python, but also address more advanced topics like running multiple instances of the crawler in parallel.

How to extract content?
The raw HTML code that is downloaded from the webserver can usually not be directly used. Before the acquired data can be used, one has first to extract (“parse”) the desired information from the raw data. Depending on the complexity of the project, parsing of the data is either done on-the-fly while crawling it or as a separate process. We introduce the concept of regular expressions and a parsing framework and show how they can be used to identify structured information within a website.

How to process the acquired data?
Depending on the time dimension of the crawling process and the size of the crawl, managing the data and converting it to a format that can be used by statistical packages can be challenging. For bigger projects it might be useful to store data directly in a relational database, while it might otherwise be fine to save to flat text files that can be imported from statistical packages.

Teaching style
The course will be run as an active learning workshop, where the lecturer will not only introduce the technical side, but will also share tips and tricks from practical experience. Students will not only receive the theoretical knowledge for data crawling but will also work hands-on on completing their own data crawling projects.

Lecture plan
Morning sessions 9.00-12.00
Afternoon sessions 13.00-16.00

We use the mornings for presenting the course content and to discuss project progress and the afternoons to work on own projects. By the end of the course, participants should have completed their own data crawling project that they conduct in a small group with other participants.

Monday 11/5/2020
Morning: Overview of the crawling process, introduction to HTML and Python
Afternoon: Screening of project ideas

Tuesday 12/5/2020
Morning: Presentation of project ideas, introduction to parsing, example of crawling project
Afternoon: Develop code to identify all observations

Wednesday 13/5/2020
Morning: Presentation of project progress, advanced crawling topics
Afternoon: Download and parse data

Thursday 14/5/2020
Morning: Data management with Stata
Afternoon: Complete and run the project code 

Friday 15/5/2020
Morning: Project presentations

Learning objectives
Subsequent to attending the course, the student should be ready to:
- Assess if when data crawling is useful
- Identify the observations to be downloaded
- Write own data crawlers
- Extract the content from the crawled data
- Process the acquired data

Exam
Students are expected to participate in all lectures. Students work on their own crawling projects in groups of two or three students. The exam form is the hand-in of the project and the oral presentation of the project on 15/5/2020.

Other
Please bring a laptop to class and install the Python 3.7 version of Anaconda (available at https://www.anaconda.com/distribution/) and Stata

Start date
11/05/2020

End date
15/05/2020

Level
PhD

ECTS
3

Language
English

Course Literature
• Claussen J. & Peukert C. (2019): Obtaining Data from the Internet: A Guide to Data Crawling in Management Research. Available at SSRN: https://ssrn.com/abstract=3403799

Fee
DKK 3,900

Minimum number of participants
8

Maximum number of participants
20

Location
Time:
Morning session: 9-12
Afternoon session: 13 - 16

Location
Copenhagen Business School
2000 Frederiksberg



Contact information
For the content of the course please contact Jörg Claussen - jcl.si@cbs.dk 

For the administration of the course please contact Nina Iversen - ni.research@cbs.dk

Registration deadline
29/04/2020

Please notice that the registration is binding after the registration deadline.
Top