|
|
|
|
1082007
|
|
|
|
Course |
Obtaining data from the Internet: Data crawling for management research
|
Faculty |
Associate Professor, Jörg Claussen
|
Course Coordinator |
Associate Professor,Jörg Claussen
|
Prerequisites |
NOTE: In view of the Corona crisis and the physical limitations that we all face, the Department of Strategy and Innovation has decided to move on-line with this course. In view of the special circumstances, the course will be offered without a course fee.
Participants should have a basic knowledge of a programming language (e.g. Python, R, PHP or Java). For participants without such knowledge, there are multiple resources available on the web that they can consult prior to the workshop. Examples will be given mostly in Python.
The workshop aims to address both the needs of beginners to the field of data crawling as well as more experienced users of crawling methods. With the many best-practice recommendations of how to execute data crawling projects that will be shared, the course format ensures that all participants will gain and apply new knowledge.
|
Aim |
An increasing number of papers published in the fields of business and economics rely on data collected from the Internet. Learning the required skills to successfully develop data crawling projects can open many interesting research opportunities for PhD students. While writing a first basic crawler can be completed within a couple lines of code, a lot of pitfalls and difficulties can make successful completion of crawling projects challenging and time-consuming.
Acquiring these skills will be facilitated through the proposed course. The goal of this course is to provide a good understanding about the possibilities of crawling, while also giving enough time for participants to work on their own project in a hands-on approach where participants actively solve exercises that are directly connected to their own research interests. Examples will include applications in the field of innovation and strategy.
|
Course content |
This course introduces PhD students to some of the key tools, frequent complications, and tips and tricks of web crawling.
The following topics will be covered:
When is crawling useful? Before starting a crawling project, it is important to carefully assess if crawling is really the appropriate tool to get the desired data. We discuss topics like data size, structure of the data, legal aspects, and technical countermeasures from website owners.
How to determine the observations to download? One key challenge for many crawling projects is that there is no readily available list of all sub-sites of a domain that should be included in the crawling process. We consider different ways to get around this problem, including the use of site maps, APIs, continuous IDs, or snowball approaches.
How to do the actual crawling? A first important consideration when setting up a crawling process is if it should be set up as a one-off task or as a repeated process in which the same websites are regularly visited to create a panel. We then look at how crawling actually works by programming first simple crawlers using the language Python, but also address more advanced topics like running multiple instances of the crawler in parallel.
How to extract content? The raw HTML code that is downloaded from the webserver can usually not be directly used. Before the acquired data can be used, one has first to extract (“parse”) the desired information from the raw data. Depending on the complexity of the project, parsing of the data is either done on-the-fly while crawling it or as a separate process. We introduce the concept of regular expressions and a parsing framework and show how they can be used to identify structured information within a website.
How to process the acquired data? Depending on the time dimension of the crawling process and the size of the crawl, managing the data and converting it to a format that can be used by statistical packages can be challenging. For bigger projects it might be useful to store data directly in a relational database, while it might otherwise be fine to save to flat text files that can be imported from statistical packages.
|
Teaching style |
The course will be run as an active learning workshop, where the lecturer will not only introduce the technical side, but will also share tips and tricks from practical experience. Students will not only receive the theoretical knowledge for data crawling but will also work hands-on on completing their own data crawling projects.
|
Lecture plan |
Morning sessions 9.00-12.00 Afternoon sessions 13.00-16.00
We use the mornings for presenting the course content and to discuss project progress and the afternoons to work on own projects. By the end of the course, participants should have completed their own data crawling project that they conduct in a small group with other participants.
Monday 11/5/2020 Morning: Overview of the crawling process, introduction to HTML and Python Afternoon: Screening of project ideas
Tuesday 12/5/2020 Morning: Presentation of project ideas, introduction to parsing, example of crawling project Afternoon: Develop code to identify all observations
Wednesday 13/5/2020 Morning: Presentation of project progress, advanced crawling topics Afternoon: Download and parse data
Thursday 14/5/2020 Morning: Data management with Stata Afternoon: Complete and run the project code
Friday 15/5/2020 Morning: Project presentations
|
Learning objectives |
Subsequent to attending the course, the student should be ready to: - Assess if when data crawling is useful - Identify the observations to be downloaded - Write own data crawlers - Extract the content from the crawled data - Process the acquired data
|
Exam |
Students are expected to participate in all lectures. Students work on their own crawling projects in groups of two or three students. The exam form is the hand-in of the project and the oral presentation of the project on 15/5/2020.
|
Other |
Please bring a laptop to class and install the Python 3.7 version of Anaconda (available at https://www.anaconda.com/distribution/) and Stata
|
Start date |
11/05/2020
|
End date |
15/05/2020
|
Level |
PhD
|
ECTS |
3
|
Language |
English
|
Course Literature |
• Claussen J. & Peukert C. (2019): Obtaining Data from the Internet: A Guide to Data Crawling in Management Research. Available at SSRN: https://ssrn.com/abstract=3403799
|
Fee |
DKK 3,900
|
Minimum number of participants |
8
|
|
|
|
Maximum number of participants |
20
|
Location |
Time:
Morning session: 9-12 Afternoon session: 13 - 16
Location: Copenhagen Business School 2000 Frederiksberg
|
Contact information |
For the content of the course please contact Jörg Claussen - jcl.si@cbs.dk
For the administration of the course please contact Nina Iversen - ni.research@cbs.dk
|
Registration deadline |
29/04/2020
|
|
Please notice that the registration is binding after the registration deadline.
|