🏫 University Degree Web Scraper

January 2025 · Python · BeautifulSoup · Data Pipeline

🔍 Project Overview

This project involved building a Python-based web scraper to collect undergraduate course information from Australian university websites. The goal was to automate the process of retrieving ATAR cut-offs, course titles, codes, faculties, locations, and links to course pages—allowing quick comparison and analysis for education consulting purposes.

⚙️ Methodology

Target Sites: Focused on university degree listings such as UQ, QUT, Griffith, UNSW, and USYD.
Scraping Logic: Used requests and BeautifulSoup to locate and parse HTML content. Extracted course blocks, filtered degree programs, and avoided non-undergraduate offerings.
Data Structuring: Normalised course information into structured dictionaries, handled nested HTML elements, and cleaned inconsistent formatting (e.g., whitespace, abbreviations).
Output: Exported the full dataset into CSV format for integration with Airtable and Excel ATAR calculator pipeline.

💡 Features

Accurate scraping of course title, ATAR, faculty, degree length, location, and web link
Handles inconsistent markup across different university websites
Batch scraping and CSV export functionality
Designed for integration with downstream consulting tools

📉 Errors & Fixes

Frequent changes in HTML structure – added CSS selector fallbacks and error logging to adapt quickly
Redirects and duplicate entries – resolved by filtering URL patterns and caching visited links
Encoding issues in CSV output – enforced UTF-8 encoding and validated with Excel

📈 Key Takeaways

Web scraping is highly dependent on site stability—robust error handling is essential
Cleaned, structured datasets from raw HTML are valuable for decision support systems
This scraper supports education consultants in giving personalised, accurate advice

🔗 Repository

GitHub Repo: University Degree Web Scraper

← Back to Blog