🏫 University Degree Web Scraper
January 2025 · Python · BeautifulSoup · Data Pipeline
🔍 Project Overview
This project involved building a Python-based web scraper to collect undergraduate course information from Australian university websites. The goal was to automate the process of retrieving ATAR cut-offs, course titles, codes, faculties, locations, and links to course pages—allowing quick comparison and analysis for education consulting purposes.
⚙️ Methodology
- Target Sites: Focused on university degree listings such as UQ, QUT, Griffith, UNSW, and USYD.
- Scraping Logic: Used
requests and BeautifulSoup to locate and parse HTML content. Extracted course blocks, filtered degree programs, and avoided non-undergraduate offerings.
- Data Structuring: Normalised course information into structured dictionaries, handled nested HTML elements, and cleaned inconsistent formatting (e.g., whitespace, abbreviations).
- Output: Exported the full dataset into CSV format for integration with Airtable and Excel ATAR calculator pipeline.
💡 Features
- Accurate scraping of course title, ATAR, faculty, degree length, location, and web link
- Handles inconsistent markup across different university websites
- Batch scraping and CSV export functionality
- Designed for integration with downstream consulting tools
📉 Errors & Fixes
- Frequent changes in HTML structure – added CSS selector fallbacks and error logging to adapt quickly
- Redirects and duplicate entries – resolved by filtering URL patterns and caching visited links
- Encoding issues in CSV output – enforced UTF-8 encoding and validated with Excel
📈 Key Takeaways
- Web scraping is highly dependent on site stability—robust error handling is essential
- Cleaned, structured datasets from raw HTML are valuable for decision support systems
- This scraper supports education consultants in giving personalised, accurate advice
🔗 Repository
GitHub Repo: University Degree Web Scraper
← Back to Blog