This Python script scrapes crime data from NDTV news articles for a given location and saves the data into a CSV file. The script uses the requests
library to fetch the web content and BeautifulSoup
to parse the HTML content. Additionally, it categorizes each crime based on keywords found in the title and description.
- Python 3
requests
beautifulsoup4
You can install the required dependencies by running the following command:
pip install requests beautifulsoup4
- Run the Python script
crime_data_scraper.py
. - Enter the location and state when prompted.
- The script will scrape the crime data for the given location and save it to a CSV file.
The CSV file will be named as <location>_crime_data.csv
, and the columns include location, time, crime type, description, state, and month.
python crime_data_scraper.py
Enter the location: delhi
Enter the state: delhi
Crime data has been saved to delhi_crime_data.csv.
This will generate a delhi_crime_data.csv
file containing the scraped crime data.
- The script currently relies on specific keywords to categorize crime types, which may lead to inaccuracies or misclassifications.
- The script only scrapes crime news from the NDTV website, which may not cover all crime incidents in a location.
- The script may have difficulty handling non-English crime news or special characters.
- Improve the categorization method by using machine learning techniques, such as natural language processing, to better understand the context of the news article.
- Expand the list of sources to scrape from, to gather a more comprehensive set of crime data.
- Add support for non-English news and handle special characters properly.
- Include additional metadata in the output, such as the URL of the news article, to provide more context.