Web Scraping - Peraturan.go.id
An automated Python web scraping tool designed to efficiently extract and structure public legal document data from the official Indonesian government website, peraturan.go.id.
This project focuses on efficient data collection, a crucial first step in any data science pipeline. The repository contains a suite of Python scripts specifically built to scrape legal documents, laws, and regulations from the official Indonesian government portal, peraturan.go.id. The tool is highly configurable and offers multiple scraping strategies to handle different data volume needs. It includes modular scripts for single-page scraping, automated pagination handling, and advanced multi-threaded scraping for high-speed data extraction. The final output is automatically organized into structured CSV files, capturing key metadata such as document titles, issuing ministries, context, and direct PDF download links. This makes the raw public data readily available for further data analysis, text mining, or database integration.
Technologies Used
Key Features
- Automated Pagination: Scripts that seamlessly navigate through multiple website pages to collect extensive datasets.
- Multi-threading Support: Utilizes threading capabilities (alpha.py) to significantly speed up the data extraction process.
- Structured CSV Output: Automatically cleans and formats scraped data into ready-to-use CSV files.
- Modular Scripts: Provides distinct scripts (scrap.py, nazo.py, nawa.py) depending on the required data depth and specific columns.
- Metadata Extraction: Efficiently targets and extracts specific data points like Title, Ministry, Context, and File URLs.