A basic Python web crawler

A web crawler, also known as spider or bot, is a computer program that automatically browse through the pages of a website and collects the data that it needs. Googlebot and Bingbot are two popular web crawlers used by Google and Bing search engines respectively. These crawlers scans a webpage, collect its content and index it. It will then follow any hyperlinks on that page to move on to the next page and the process is repeated. There are different ways a website author can tell a crawler not to crawl a particular page. One such method is to use the "nofollow" attribute in HTML anchor tag <a>

Here is a basic web crawler program written in Python that crawls a website to find any broken links.

Program Logic

This program requires three modules - sys, requests and lxml. Sys module gives the program access to the command line argument. Request module offers the capability to send HTTP requests and Lxml module is for parsing HTML documents.

The start page from which the crawling begins is passed as an argument to the program. For example to crawl the website www.mallukitchen.com you run the command.

python WebCrawl.py www.mallukitchen.com

where WebCrawl.py is the name of your Python program.

A list variable called site_links stores all the hyperlinks that are discovered. When execution begins, this list contains just the base URL.

The URLs are checked to determine if it if it is an absolute link, a relative link (i.e., a location relative to the current page), a root-relative link (i.e., a location relative to the base directory). Relative and root-relative type of links are converted into absolute link format and then passed to the getlinks(url) function. Inside the getlinks(url) function an HTTP request is made to the URL. The resulting response code is checked to see if the link is accessible or broken and the corresponding result is printed

The response content is also parsed to obtain all hyperlinks in that page. Links with "nofollow" attributes and links that point to page sections (those which start with a #) are ignored. The discovered hyperlinks are appended to the site_links list variable and the process is repeated.

Program Source

#!/usr/bin/python

from lxml import html
import requests
import sys

# Base url from where the crawl begins
base  = sys.argv[1]

#Function to get the links and their status
def getlinks(url):
    response = requests.get(url)
    if (response.status_code == 200):
        status = '[OK]'
    else:
        status = '[BROKEN]'
    print(status, url )
    pagehtml = html.fromstring(response.content)
    page_links = pagehtml.xpath("//a[not(@rel='nofollow') and not(contains(@href,'#'))]/@href")
    return page_links

site_links =[base]

#Repeat for each link discovered
for item in site_links:
    #absolute link
    if item.startswith('http://') or item.startswith('https://'):
        url = item
    #root-relative link
    elif item.startswith('/'):
        url = base + item
    else:
        url=base + "/" + item
    
    page_links = getlinks(url)
    for link in page_links:
        if link not in site_links:
                site_links.append(link)


Post a comment

Name

Your Comment

Email (We dont publish it)

Comments

Nothing yet..be the first to share wisdom.