Web crawler in Node JS and Mongo DB

In this article, we are going to create a web crawler using Node JS and Mongo DB. It will take a URL as an input and fetch all the anchor tags, headings, and paragraphs. You can add more features to it if you want.

Requirements

Make sure you have the following things installed in your system:

  1. Node JS
  2. Mongo DB
  3. Code Editor (Sublime Text etc.)

Video tutorial:

Setup the Project

First, create an empty folder anywhere in your system. Create a file named server.js in that folder. Open CMD in that folder by running the following command:

cd "path_of_your_folder"

We are going to need multiple modules for this web crawler. So, install them from the command:

npm install express http ejs socket.io request cheerio express-formidable mongodb htmlspecialchars node-html-parser

Now we explain the reason for the installation of the above modules.

  1. express framework is used for routing.
  2. http is used to run HTTP requests.
  3. ejs is a template engine used for rendering HTML files.
  4. socket.io is used for realtime communication.
  5. request is used to fetch content of web page.
  6. cheerio is used for jQuery DOM manipulation.
  7. express-formidable to get values from FormData object.
  8. mongodb will be our database.
  9. htmlspecialchars is used to convert HTML tags into entities.
  10. node-html-parser to convert the HTML string into DOM nodes.

After all the modules are installed, run the following command to start the server:

npm install -g nodemon
nodemon server.js

Start the server

Open your server.js and write the following code in it to start the server at port 3000.

var express = require("express");
var app = express();
var http = require("http").createServer(app);
http.listen(3000, function () {
    console.log("Server started running at port: 3000");
});

Now your project will be up and running at http://localhost:3000/

Connect Node JS with Mongo DB

To connect Node JS with Mongo DB, first, we need to create an instance of Mongo DB client in our server.js. Place following lines before http.listen function.

var mongodb = require("mongodb");
var mongoClient = mongodb.MongoClient;
var ObjectID = mongodb.ObjectID;
var database = null;

Now write the following code inside the http.listen callback function.

mongoClient.connect("mongodb://localhost:27017", {
    useUnifiedTopology: true
}, function (error, client) {
    if (error) {
        throw error;
    }
    database = client.db("web_crawler");
    console.log("Database connected");
});

If you check your CMD now, you will see the message “Database connected”.

Crawl the web page

Now we need to create a form to get the URL as an input. So first we will tell our express app that we will be using EJS as our templating engine. And all our CSS and JS files will be inside the public folder. Place following lines before http.listen function.

// server.js
app.set('view engine', 'ejs');
app.use("/public", express.static(__dirname + "/public"));

Now create 2 folders at the root of your project, “public” and “views”. Download the latest jQuery, Bootstrap, DataTable, and Socket IO libraries and placed their files inside the public folder. Create a new file named index.ejs inside views folder. Create a GET route in our server.js when the Mongo DB is connected.

app.get("/", async function (request, result) {
    result.render("index");
});

If you access your project from the browser now, you will see an empty screen. Open your index.ejs and write the following code in it:

<link rel="stylesheet" href="/public/bootstrap.css" />
<div class="container" style="margin-top: 150px;">
    <div class="row">
        <div class="col-md-8">
            <form method="POST" onsubmit="return crawlPage(this);">
                <div class="form-group">
                    <label>Enter URL</label>
                    <input type="url" name="url" class="form-control" required />
                </div>
                <input type="submit" name="submit" value="Crawl" class="btn btn-info" />
            </form>
        </div>
    </div>
</div>
<script src="/public/jquery-3.3.1.min.js"></script>
<script src="/public/bootstrap.js"></script>
<style>
    body {
        background: linear-gradient(0deg, #00fff3, #a5a5a5);
    }
</style>

You will now see a simple form with an input field and a submit button. In that input field, you can enter the URL of the page you wanted to crawl. Now we need to create a Javascript function that will be called when the form is submitted. In that function, we will call an AJAX request to the Node JS server.

<script>
    function crawlPage(form) {
        var ajax = new XMLHttpRequest();
        ajax.open("POST", "/crawl-page", true);
        ajax.onreadystatechange = function () {
            if (this.readyState == 4) {
                if (this.status == 200) {
                    // console.log(this.responseText);
                    var data = JSON.parse(this.responseText);
                    // console.log(data);
                }
            }
        };
        var formData = new FormData(form);
        ajax.send(formData);
        return false;
    }
</script>

Get the web page content

To fetch the content of the web page, first, we will use express-formidable as our middleware. Also, we will require the modules required to read the web page and convert its HTML into DOM nodes. Write the following lines before http.listen function.

const formidableMiddleware = require('express-formidable');
app.use(formidableMiddleware());
const requestModule = require("request");
const cheerio = require('cheerio');
var htmlspecialchars = require("htmlspecialchars");
var HTMLParser = require('node-html-parser');
var io = require("socket.io")(http, {
    "cors": {
        "origin": "*"
    }
});

After that, we will create a POST route to crawl the web page.

app.post("/crawl-page", async function (request, result) {
    var url = request.fields.url;
    crawlPage(url);
    
    result.json({
        "status": "success",
        "message": "Page has been crawled",
        "url": url
    });
});

Our web crawler runs in a separate function to crawl the web page. Then we will create the functions to crawl the web page and save its content in Mongo DB. Write the following functions at the top of your server.js file.

function getTagContent(querySelector, content, pageUrl) {
    var tags = content.querySelectorAll(querySelector);
    var innerHTMLs = [];
    for (var a = 0; a < tags.length; a++) {
        var content = "";
        var anchorTag = tags[a].querySelector("a");
        if (anchorTag != null) {
            content = anchorTag.innerHTML;
        } else {
            content = tags[a].innerHTML;
        }
        content = content.replace(/\s+/g,' ').trim();
        if (content.length > 0) {
            innerHTMLs.push(content);
        }
    }
    return innerHTMLs;
}
function crawlPage(url, callBack = null) {
    var pathArray = url.split( '/' );
    var protocol = pathArray[0];
    var host = pathArray[2];
    var baseUrl = protocol + '//' + host;
    io.emit("crawl_update", "Crawling page: " + url);
    requestModule(url, async function (error, response, html) {
        if (!error && response.statusCode == 200) {
            var $ = cheerio.load(html);
            // Get text 
            // console.log("------- with request module -------")
            // console.log($.text());
            // Get HTML 
            // console.log($.html());
            var page = await database.collection("pages").findOne({
                "url": url
            });
            if (page == null) {
                var html = $.html();
                var htmlContent = HTMLParser.parse(html);
                var allAnchors = htmlContent.querySelectorAll("a");
                var anchors = [];
                for (var a = 0; a < allAnchors.length; a++) {
                    var href = allAnchors[a].getAttribute("href");
                    var title = allAnchors[a].innerHTML;
                    var hasAnyChildTag = (allAnchors[a].querySelector("div") != null)
                        || (allAnchors[a].querySelector("img") != null)
                        || (allAnchors[a].querySelector("p") != null)
                        || (allAnchors[a].querySelector("span") != null)
                        || (allAnchors[a].querySelector("svg") != null)
                        || (allAnchors[a].querySelector("strong") != null);
                    if (hasAnyChildTag) {
                        continue;
                    }
                    if (href != null) {
                        
                        if (href == "#" || href.search("javascript:void(0)") != -1) {
                            continue;
                        }
                        var first4Words = href.substr(0, 4);
                        if (href.search(url) == -1 && first4Words != "http") {
                            if (href[0] == "/") {
                                href = baseUrl + href;
                            } else {
                                href = baseUrl + "/" + href;
                            }
                        }
                        anchors.push({
                            "href": href,
                            "text": title
                        });
                    }
                }
                io.emit("crawl_update", htmlspecialchars("<a>") + " tags has been crawled");
                var titles = await getTagContent("title", htmlContent, url);
                var title = titles.length > 0 ? titles[0] : "";
                io.emit("crawl_update", htmlspecialchars("<title>") + " tag has been crawled");
                var h1s = await getTagContent("h1", htmlContent, url);
                io.emit("crawl_update", htmlspecialchars("<h1>") + " tags has been crawled");
                var h2s = await getTagContent("h2", htmlContent, url);
                io.emit("crawl_update", htmlspecialchars("<h2>") + " tags has been crawled");
                var h3s = await getTagContent("h3", htmlContent, url);
                io.emit("crawl_update", htmlspecialchars("<h3>") + " tags has been crawled");
                var h4s = await getTagContent("h4", htmlContent, url);
                io.emit("crawl_update", htmlspecialchars("<h4>") + " tags has been crawled");
                var h5s = await getTagContent("h5", htmlContent, url);
                io.emit("crawl_update", htmlspecialchars("<h5>") + " tags has been crawled");
                var h6s = await getTagContent("h6", htmlContent, url);
                io.emit("crawl_update", htmlspecialchars("<h6>") + " tags has been crawled");
                var ps = await getTagContent("p", htmlContent, url);
                io.emit("crawl_update", htmlspecialchars("<p>") + " tags has been crawled");
                var object = {
                    "url": url,
                    "anchors": anchors,
                    "title": title,
                    "h1s": h1s,
                    "h2s": h2s,
                    "h3s": h3s,
                    "h4s": h4s,
                    "h5s": h5s,
                    "h6s": h6s,
                    "ps": ps,
                    "time": new Date().getTime()
                };
                try {
                    await database.collection("pages").insertOne(object);
                } catch (e) {
                    console.log(e);
                }
                io.emit("page_crawled", object);
                io.emit("crawl_update", "Page crawled.");
            } else {
                io.emit("crawl_update", "Page already crawled.");
            }
            if (callBack != null) {
                callBack();
            }
        }
    });
}

If you refresh the page now and enter the URL of any web page and hit enter, you will see its content is stored in the Mongo DB database named web_crawler. To check the data from Mongo DB, you can download a software named Mongo DB Compass.

Show data in DataTable

Now whenever a new web page is crawled, we will display that in a table. We will be using a library called DataTable. We will also include the socket IO library for real-time communication. So include those files in your index.ejs:

<link rel="stylesheet" href="/public/jquery.dataTables.min.css" />
<script src="/public/socket.io.js"></script>
<script src="/public/jquery.dataTables.min.js"></script>

Then we will create a row with 2 columns. On the left column, we will create a table to display all crawled tables. And on the right column, we will display all crawled updates e.g. “headings has been crawled”, “paragraphs” has been crawled” etc.

<div class="row">
    <div class="col-md-8">
        <table class="table table-bordered" id="my-table">
            <thead>
                <tr>
                    <th>URL</th>
                    <th>Title</th>
                    <th>Time</th>
                </tr>
            </thead>
            <tbody id="data"></tbody>
        </table>
    </div>
    <div class="col-md-4">
        <ul class="list-group" id="my-updates"></ul>
    </div>
</div>

Just to make it look better, you can apply the following styles in CSS.

#my-updates {
    max-height: 300px;
    overflow-y: scroll;
    width: fit-content;
}
.table-bordered th, .table-bordered td,
.dataTables_wrapper .dataTables_filter input {
    border: 1px solid black !important;
}
.table thead th {
    border-bottom: 2px solid black !important;
}

Then we need to initialize the data table library. Also, attach event listeners for crawl updates. Crawl updates will ab prepended in the <ul> list. The complete crawled web pages will be appended in the data table.

var table = null;
var socketIO = io("http://localhost:3000/");
var months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"];
window.addEventListener("load", function () {
    table = $('#my-table').DataTable({
        "order": [[ 2, "asc" ]]
    });
});
socketIO.on("crawl_update", function (data) {
    // console.log(data);
    var html = "";
    html += `<li class="list-group-item">` + data + `</li>`;
    document.getElementById("my-updates").innerHTML = html + document.getElementById("my-updates").innerHTML;
    document.getElementById('my-updates').scrollTop = 0;
});
socketIO.on("page_crawled", function (data) {
    // console.log(data);
    var date = new Date(data.time);
    var time = date.getDate() + " " + months[date.getMonth() + 1] + ", " + date.getFullYear() + " - " + date.getHours() + ":" + date.getMinutes() + ":" + date.getSeconds();
    table.row.add( [
        "<a href='/page/" + encodeURIComponent(data.url) + "'>" + data.url + "</a>",
        data.title,
        time
    ] ).draw( false );
});

Now you will see the data in the table when you crawl some page. You can crawl as many pages as you want.

Fetch data from Mongo DB

At this point, data in the data table is only displayed when you crawl some page. But when you reload the page, the data table will be empty. However, the data is still stored in the database. Our web crawler has saved all the crawled pages in a Mongo DB collection named “pages”. So we need to populate the previously saved pages from the database in the data table when the page loads.

First, change our “/” GET route in the server.js to the following:

app.get("/", async function (request, result) {
    var months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"];
    var pages = await database.collection("pages").find({})
        .sort({
            "time": -1
        }).toArray();
    for (var index in pages) {
        var date = new Date(pages[index].time);
        var time = date.getDate() + " " + months[date.getMonth() + 1] + ", " + date.getFullYear() + " - " + date.getHours() + ":" + date.getMinutes() + ":" + date.getSeconds();
        
        pages[index].time = time;
    }            
    result.render("index", {
        "pages": pages
    });
});

And in our index.ejs inside the <tbody> tag, we will display all the pages.

<tbody id="data">
    <% for (var index in pages) { %>
        <tr>
            <td>
                <a href="/page/<%= encodeURIComponent(pages[index].url) %>">
                    <%= pages[index].url %>
                </a>
            </td>
            <td><%= pages[index].title %></td>
            <td><%= pages[index].time %></td>
        </tr>
    <% } %>
</tbody>

If you refresh the page now, you will see all pages in the data table. You will only see the URL, title, and the time when the page was recently crawled. But we need to know the anchor tags on that page, all the headings, and paragraphs in it.

Show page content

Click on any of the links from the data table and it will take you to an error page. We need to convert that error page into a detailed page. Create a GET route in our server.js that will fetch the page from the database and send it to an HTML file.

app.get("/page/:url", async function (request, result) {
    var url = request.params.url;
    var page = await database.collection("pages").findOne({
        "url": url
    });
    if (page == null) {
        result.render("404", {
            "message": "This page has not been crawled"
        });
        return false;
    }
    result.render("page", {
        "page": page
    });
});

In your views folder, create a file named 404.ejs that will be displayed when the URL is not been crawled yet.

<!-- 404.ejs -->
<link rel="stylesheet" href="/public/bootstrap.css" />
<div class="jumbotron">
    <h1 class="display-4">404 - Not Found</h1>
    <p class="lead"><%= message %></p>
</div>
<script src="/public/jquery-3.3.1.min.js"></script>
<script src="/public/bootstrap.js"></script>

Now create a file named page.ejs inside the views folder. Inside this file, we will show all the crawled tags in separate data tables.

<link rel="stylesheet" href="/public/bootstrap.css" />
<link rel="stylesheet" href="/public/font-awesome-4.7.0/css/font-awesome.css" />
<link rel="stylesheet" href="/public/jquery.dataTables.min.css" />
<div class="container" style="margin-top: 50px;">
    <div class="jumbotron">
        <h1><%= page.title %></h1>
        <div class="row">
            <div class="col-md-1">
                <form method="POST" action="/delete-page" onsubmit="return confirm('Are you sure you want to delete this page ?');">
                    <input type="hidden" name="url" value="<%= page.url %>" required />
                    <input type="submit" class="btn btn-danger" value="Delete" />
                </form>
            </div>
            <div class="col-md-1">
                <form method="POST" action="/reindex" onsubmit="return confirm('Are you sure you want to re-index this page ?');">
                    <input type="hidden" name="url" value="<%= page.url %>" required />
                    <input type="submit" class="btn btn-primary" value="Re-index" />
                </form>
            </div>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>Anchors</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.anchors) { %>
                        <tr>
                            <td>
                                <a href="<%= page.anchors[index].href %>">
                                    <%= page.anchors[index].text %>
                                </a>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>H1</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.h1s) { %>
                        <tr>
                            <td>
                                <%= page.h1s[index] %>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>H2</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.h2s) { %>
                        <tr>
                            <td>
                                <%= page.h2s[index] %>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>H3</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.h3s) { %>
                        <tr>
                            <td>
                                <%= page.h3s[index] %>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>H4</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.h4s) { %>
                        <tr>
                            <td>
                                <%= page.h4s[index] %>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>H5</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.h5s) { %>
                        <tr>
                            <td>
                                <%= page.h5s[index] %>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>H6</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.h6s) { %>
                        <tr>
                            <td>
                                <%= page.h6s[index] %>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
    <div class="row">
        <div class="col-md-12">
            <table class="table table-bordered my-table">
                <thead>
                    <tr>
                        <th>P</th>
                    </tr>
                </thead>
                <tbody>
                    <% for (var index in page.ps) { %>
                        <tr>
                            <td>
                                <%= page.ps[index] %>
                            </td>
                        </tr>
                    <% } %>
                </tbody>
            </table>
        </div>
    </div>
</div>
<script>
    window.addEventListener("load", function () {
        $('.my-table').DataTable();
    });
</script>
<style>
    .row {
        margin-top: 50px;
    }
    .table-bordered th, .table-bordered td,
    .dataTables_wrapper .dataTables_filter input {
        border: 1px solid black !important;
    }
    .table thead th {
        border-bottom: 2px solid black !important;
    }
    body {
        background: linear-gradient(0deg, #00fff3, #a5a5a5);
    }
</style>
<script src="/public/jquery-3.3.1.min.js"></script>
<script src="/public/bootstrap.js"></script>
<script src="/public/jquery.dataTables.min.js"></script>

Along with all the data on the web page, it will also show 2 buttons to “delete” and to “reindex”. Delete simply means to delete the page from the database. “Reindex” means to re-crawl the web page to fetch updated content. First, we will create a POST route for deleting the page in our server.js file.

app.post("/delete-page", async function (request, result) {
    var url = request.fields.url;
    await database.collection("pages").deleteOne({
        "url": url
    });
    io.emit("page_deleted", url);
    var backURL = request.header('Referer') || '/';
    result.redirect(backURL);
});

And in our index.ejs we will attach an event listener that will be called when the page is deleted. In that function, we will simply remove that row from the data table.

Remove specific row from DataTable.js

socketIO.on("page_deleted", function (url) {
    table
        .rows( function ( idx, data, node ) {
            return data[0].includes(url);
        } )
        .remove()
        .draw();
});

This will search for the first row with the URL in its content and remove it. After removal, it will re-render the data table to reload the table.

Re-index the page

Now we need to add a function to re-index the page, which means to get the updated content of the page. As we did for delete, we will also create a form for re-indexing.

<div class="col-md-1">
    <form method="POST" action="/reindex" onsubmit="return confirm('Are you sure you want to re-index this page ?');">
        <input type="hidden" name="url" value="<%= page.url %>" required />
        <input type="submit" class="btn btn-primary" value="Re-index" />
    </form>
</div>

This will show a “Re-index” button along with a delete button. Then we need to create a POST route in our server.js:

app.post("/reindex", async function (request, result) {
    var url = request.fields.url;
    await database.collection("pages").deleteOne({
        "url": url
    });
    io.emit("page_deleted", url);
    crawlPage(url, function () {
        var backURL = request.header('Referer') || '/';
        result.redirect(backURL);
    });
});

That’s it. If you run the code now, you will be able to re-index the page and update your database. You can try this feature by first crawling a URL today. Then try to crawl the same URL again a few days or weeks later.

So that’s how you can create a simple web crawler in Node JS and Mongo DB. You can check our financial ledger tutorial to learn how to create a financial ledger in Node JS and Mongo DB.