Google one tap sign in allows your website users to quickly log in to your site without having to enter a lot of form fields.
Video tutorial:
First, you need to go to Google Developer Console, from there you need to select or create a new project.
After a project is selected, you need to create its credentials.
Create credentials – Google one tap signin
In the credentials screen, you need to create OAuth Client ID credentials.
OAuth client ID create credentials
After that, you can specify:
The type of application as “Web Application”.
Set the name of the credential as per your desire.
Add your website domain in “Authorized Javascript origins”.
Goto OAuth Consent Screen.
Create oauth client id – Google one tap signin
OAuth Consent Screen will show you a form where you can enter the text you want to show to the user in the login prompt.
After everything is done, click “Save and Continue”. It will show you your Client ID and Client Secret.
OAuth Client Created – Google one-tap sign in
Install Google OAuth Library
You need to copy both Google Client ID and Client Secret values as your PHP variables.
<?php
// use sessions, to show the login prompt only if the user is not logged-in
session_start();
// paste google client ID and client secret keys
$google_oauth_client_id = "";
$google_oauth_client_secret = "";
?>
We will be using PHP sessions because we will show this prompt only if the user is not logged in.
Make sure you have Composer installed. You can download and install Composer from here.
After that, you need to run the following command at your root folder:
composer require google/apiclient
Display Google One-tap Sign In Prompt
Paste the following lines in the file where you want to show the one-tap sign-in prompt. If you want to show it to all the pages of your website, simply paste it into your header or footer file.
<!-- check if the user is not logged in -->
<?php if (!isset($_SESSION["user"])): ?>
<!-- display the login prompt -->
<script src="https://accounts.google.com/gsi/client" async defer></script>
<div id="g_id_onload"
data-client_id="<?php echo $google_oauth_client_id; ?>"
data-context="signin"
data-callback="googleLoginEndpoint"
data-close_on_tap_outside="false">
</div>
<?php endif; ?>
This will make sure to show the prompt to the user only if he is not logged in. data-callback will be our Javascript function that will be called when the user taps the login button.
<script>
// callback function that will be called when the user is successfully logged-in with Google
function googleLoginEndpoint(googleUser) {
// get user information from Google
console.log(googleUser);
// send an AJAX request to register the user in your website
var ajax = new XMLHttpRequest();
// path of server file
ajax.open("POST", "google-sign-in.php", true);
// callback when the status of AJAX is changed
ajax.onreadystatechange = function () {
// when the request is completed
if (this.readyState == 4) {
// when the response is okay
if (this.status == 200) {
console.log(this.responseText);
}
// if there is any server error
if (this.status == 500) {
console.log(this.responseText);
}
}
};
// send google credentials in the AJAX request
var formData = new FormData();
formData.append("id_token", googleUser.credential);
ajax.send(formData);
}
</script>
We are sending an AJAX request to our server so we can verify the user using his Google Credentials Token. The server must do the verification because anyone can exploit the client-side variables.
Verify Google Credentials Token – PHP
We are creating a file named “google-sign-in.php” where we will do this verification.
<?php
// use sessions
session_start();
// include google API client
require_once "vendor/autoload.php";
// set google client ID
$google_oauth_client_id = "";
// create google client object with client ID
$client = new Google_Client([
'client_id' => $google_oauth_client_id
]);
// verify the token sent from AJAX
$id_token = $_POST["id_token"];
$payload = $client->verifyIdToken($id_token);
if ($payload && $payload['aud'] == $google_oauth_client_id)
{
// get user information from Google
$user_google_id = $payload['sub'];
$name = $payload["name"];
$email = $payload["email"];
$picture = $payload["picture"];
// login the user
$_SESSION["user"] = $user_google_id;
// send the response back to client side
echo "Successfully logged in. " . $user_google_id . ", " . $name . ", " . $email . ", " . $picture;
}
else
{
// token is not verified or expired
echo "Failed to login.";
}
?>
Here, you need to place your Google Client ID again. This will verify the Google Credentials Token. This will also start the user session so the next time user refreshes the page, it will not show the login prompt.
You can learn about saving the data in the database from here.
Sockets are used for real-time communication. They are now being used in chat apps, team collaboration tools, and many more. Socket IO emit events to the receivers and the receivers are constantly listening to that event. When the event is received on the client-side, they can perform the necessary action. You can attach as many event listeners as you want and perform different actions for each event.
Users are connected with a Node JS server using a client-side library called Socket IO. Users can also join the room which will be helpful if you are creating a group chat app. There are 4 ways in which socket events are fired.
Send event to all connected users, including the sender.
Send event to all users, except the sender.
Emit event to all users in a room.
Send event to specific users.
In this tutorial, we will be covering the 4th part i.e. send socket events to specific users.
Video tutorial:
Problem
Suppose you have a chat app where you want 2 people to have a private chat. Now you want to have a real-time effect i.e. to show the new messages without having to refresh the page. This requires sockets that send the data in real-time. And we can show the data on the client-side in real-time too. Now when a sender sends a message to a specific user, we need to send the socket event to that specific user only.
Solution
We will create a simple script that allows us to send events to a specific user only. You can then integrate and customize that logic in your project. First, you need to download and install Node JS. You also need to download the Socket IO JS client-side library. We will have a simple database from where we can show all the users in a list, with a button to send an event to that user only. So we need to create a database with a simple users table, you can use your own database as well.
Database
In your phpMyAdmin, create a database named “send_socket_event_to_specific_users”. In that database, create a users table:
CREATE TABLE `users` (
`id` int(11) NOT NULL PRIMARY KEY AUTO_INCREMENT,
`name` text NOT NULL
);
Add few rows in that table so we can show them in a list or a table.
It will show all users in a table with a button to send message. Now when the page loads, we need to get the ID of the user, you can also get it from PHP sessions.
Include Socket IO library
Before that, we need to include the Socket IO JS library. You can download it from here.
<script src="socket.io.js"></script>
<script>
var userId = prompt("Enter user ID");
var socketIO = io("http://localhost:3000");
socketIO.emit("connected", userId);
</script>
It will store your ID in userId variable. And it will connect with Node JS server and emit an event “connected” with your ID.
Now we need to create a simple Node server. Create an empty folder and create a file named “server.js” in it. Then open the CMD in that folder and run the following commands one-by-one:
var express = require("express");
var app = express();
var http = require("http").createServer(app);
var socketIO = require("socket.io")(http, {
cors: {
origin: "*"
}
});
var users = [];
socketIO.on("connection", function (socket) {
socket.on("connected", function (userId) {
users[userId] = socket.id;
});
// socket.on("sendEvent") goes here
});
http.listen(process.env.PORT || 3000, function () {
console.log("Server is started.");
});
This will start the server at port 3000, creates a users array and store all connected user’s socket ID in it.
Send event using socket IO emit function
Back in index.php, we need to create a JS function to send the event when the user click the “Send message” button:
function sendEvent(form) {
event.preventDefault();
var message = prompt("Enter message");
socketIO.emit("sendEvent", {
"myId": userId,
"userId": form.id.value,
"message": message
});
}
Now in server.js, we need to listen to that event and send the message to that user only. But before that, we need to include mysql module because the user’s names are stored in mysql database. At the top of your server.js:
var mysql = require("mysql");
var connection = mysql.createConnection({
host: "localhost",
port: 3306,
user: "root",
password: "",
database: "send_socket_event_to_specific_users"
});
connection.connect(function (error) {
console.log("Database connected: " + error);
});
And after the socket connected event:
socket.on("sendEvent", async function (data) {
connection.query("SELECT * FROM users WHERE id = " + data.userId, function (error, receiver) {
if (receiver != null) {
if (receiver.length > 0) {
connection.query("SELECT * FROM users WHERE id = " + data.myId, function (error, sender) {
if (sender.length > 0) {
var message = "New message received from: " + sender[0].name + ". Message: " + data.message;
socketIO.to(users[receiver[0].id]).emit("messageReceived", message);
}
});
}
}
});
});
This will search the sender and receiver by ID, and emit the event to the receiver with the name of the sender.
Listen to socket IO events
Now we need to listen to that event in our index.php and show a message in a list when that event is received. First, create a ul where all messages will be displayed:
<ul id="messages"></ul>
Then attach that event in JS:
socketIO.on("messageReceived", function (data) {
var html = "<li>" + data + "</li>";
document.getElementById("messages").innerHTML = html + document.getElementById("messages").innerHTML;
});
So that’s how you can use the socket IO emit function to send the event to a specific user only.
Check out realtime chat app tutorial using socket IO.
In this tutorial, we are going to show you, how you can show a sweetalert confirmation dialog when submitting a form. For example, if you have a form that when submits delete the data from the database. In that case, you must verify with the user because he might click that button by accident. So you can show a nice dialog using the Sweetalert library. Suppose you have the following form:
When the form submits, we are calling a Javascript function submitForm and passing the form as a parameter. Then you need to download the Sweetalert library from here. After downloading, paste that into your project and include it in your HTML file:
<script src="sweetalert.min.js"></script>
Now, we can create that Javascript function that will ask for confirmation. Once confirmed, it will submit the form.
<script>
function submitForm(form) {
swal({
title: "Are you sure?",
text: "This form will be submitted",
icon: "warning",
buttons: true,
dangerMode: true,
})
.then(function (isOkay) {
if (isOkay) {
form.submit();
}
});
return false;
}
</script>
At this point, if you submit the form, you will see a SweetAlert confirmation dialog first. All the form fields will be submitted correctly on the server-side. You can check it by printing out all the values received from the form:
In this tutorial, we are going to show you, how you can email a download link of a file to a user when they request to download files from your website. We are going to create a system where you can upload files and the user will be able to download them. But before downloading, they must enter their email address. The link to download the file will be emailed to the user. In this way, you will get a large number of emails. This will help you with a very large email list.
Prevent direct access to files
First, we are going to prevent users from downloading the files directly from the URL. We will be storing all our uploaded files in the “uploads” folder. So, create a new folder named “uploads” at the root of your project. Create a “.htaccess” file in this folder. In this file, write the following single line:
deny from all
This will gives a “403 Forbidden” error whenever someone tries to access the file directly from the browser.
Upload files
Now we are going to create a form that will allow you to upload files. You can create this form in your admin panel because usually, the administrator of the website can upload files.
Create a database named “collect_emails_while_downloading_files” in your phpMyAdmin. Or you can use your own database if you already have one. In this database, you need to create a table where we will store the path and name of all uploaded files.
CREATE TABLE files (
id INTEGER(11) PRIMARY KEY AUTO_INCREMENT NOT NULL,
file_name TEXT NOT NULL,
file_path TEXT NOT NULL
);
Then we need to save the selected file in the uploads folder and its path in the files table in the MySQL database. We will be using PHP PDO prepared statements that will help us from preventing the SQL injection.
<?php
// connect with database
$conn = new PDO("mysql:host=localhost:8889;dbname=collect_emails_while_downloading_files", "root", "root");
// check if form is submitted, for admin panel only
if (isset($_POST["upload"]))
{
// get the file
$file = $_FILES["file"];
// make sure it does not have any error
if ($file["error"] == 0)
{
// save file in uploads folder
$file_path = "uploads/" . $file["name"];
move_uploaded_file($file["tmp_name"], $file_path);
// save file path in database, prevent SQL injection too
$sql = "INSERT INTO files(file_name, file_path) VALUES (:file_name, :file_path)";
$result = $conn->prepare($sql);
$result->execute([
":file_name" => $file["name"],
":file_path" => $file_path
]);
}
else
{
die("Error uploading file.");
}
}
// get all files
$sql = "SELECT * FROM files ORDER BY id DESC";
$result = $conn->query($sql);
$files = $result->fetchAll();
?>
Refresh the page and try uploading a file. You will see it will be saved in your uploads folder and its path and name will be stored in the files table. Also, try accessing that from the browser directly, it will give you a 403 Forbidden error.
Show all uploaded files
In the previous step, we ran a query to fetch all files sorting from latest to oldest. Now we need to show them on a table.
This will show all files in a table with a button to download. When that button is clicked, we need to get the user’s email address so we can email him a download link for that file. That link will be valid for that file and for that email only.
<script>
function onFormSubmit(form) {
// get email address and submit
var email = prompt("Enter your email:", "");
if (email != null && email != "") {
form.email.value = email;
return true;
}
return false;
}
</script>
Send download link in email
We will be using the PHPMailer library to send emails. Open CMD at the root folder of your project and run the following command. Make sure you have the composer downloaded and installed in your system.
composer require phpmailer/phpmailer
Create a table in your database that will store all the download requests of files sent by users. Run the following query in your database in phpMyAdmin:
CREATE TABLE download_requests (
id INTEGER(11) PRIMARY KEY AUTO_INCREMENT NOT NULL,
file_id INTEGER(11) NOT NULL,
email TEXT NOT NULL,
token TEXT NOT NULL,
CONSTRAINT fk_file_id FOREIGN KEY (file_id) REFERENCES files (id) ON DELETE CASCADE ON UPDATE CASCADE
);
Create a file named “check-email.php” and write the following code in it. It will send the email to the user and also will store the data in the above created table.
<?php
// composer require phpmailer/phpmailer
// include PHPMailer library
use PHPMailer\PHPMailer\PHPMailer;
use PHPMailer\PHPMailer\SMTP;
use PHPMailer\PHPMailer\Exception;
require 'vendor/autoload.php';
// connect with database
$conn = new PDO("mysql:host=localhost:8889;dbname=collect_emails_while_downloading_files", "root", "root");
// get all form values
$id = $_POST["id"];
$email = $_POST["email"];
// generate a unique token for this email only
$token = time() . md5($email);
// get file from database
$sql = "SELECT * FROM files WHERE id = :id";
$result = $conn->prepare($sql);
$result->execute([
":id" => $id
]);
$file = $result->fetch();
if ($file == null)
{
die("File not found");
}
// insert in download requests, prevent SQL injection too
$sql = "INSERT INTO download_requests(file_id, email, token) VALUES (:id, :email, :token)";
$result = $conn->prepare($sql);
$result->execute([
":id" => $id,
":email" => $email,
":token" => $token
]);
// send email to user
$mail = new PHPMailer(true);
try
{
$mail->SMTPDebug = 0;
$mail->isSMTP();
$mail->Host = 'smtp.gmail.com';
$mail->SMTPAuth = true;
$mail->Username = 'your_email@gmail.com';
$mail->Password = 'your_password';
$mail->SMTPSecure = PHPMailer::ENCRYPTION_STARTTLS;
$mail->Port = 587;
$mail->setFrom('adnan@gmail.com', 'Adnan');
$mail->addAddress($email); // Add a recipient
$mail->addReplyTo('adnan@gmail.com', 'Adnan');
// Content
$mail->isHTML(true);
$mail->Subject = 'Download your files';
// mention download link in the email
$email_content = "Kindly click the link below to download your files: <br />";
$base_url = "http://localhost:8888/tutorials/collect-emails-while-downloading-files-php-mysql";
$email_content .= "<a href='" . $base_url . "/download.php?email=" . $email . "&token=" . $token . "'>" . $file['file_name'] . "</a>";
$mail->Body = $email_content;
$mail->send();
echo '<p>Link to download files has been sent to your email address: ' . $email . '</p>';
}
catch (Exception $e)
{
die("Message could not be sent. Mailer Error: " . $mail->ErrorInfo);
}
Make sure to change the base URL at line 68. Also, change your email address and password on lines 53 and 54 respectively. This email will be used to send the emails. Goto this link and enable a less secure apps option for that email address.
Enable less secure apps – Gmail
Test the code now. You will be able to see a list of all uploaded files with a button to download. When clicked, it will show a prompt where you can enter your email address. When clicked “OK”, it will send an email with a download link and also it will store it in the “download_requests” table in the MySQL database.
In your email, you will see a link to download the file. But right now it will give a 404 Not found error because the file is not created yet.
Download the file from email download link
Create a file named “download.php” and write the following code in it. This will directly download the file into your system.
<?php
// connect with database
$conn = new PDO("mysql:host=localhost:8889;dbname=collect_emails_while_downloading_files", "root", "root");
// get variables from email
$email = $_GET["email"];
$token = $_GET["token"];
// check if the download request is valid
$sql = "SELECT *, download_requests.id AS download_request_id FROM download_requests INNER JOIN files ON files.id = download_requests.file_id WHERE download_requests.email = :email AND download_requests.token = :token";
$result = $conn->prepare($sql);
$result->execute([
":email" => $email,
":token" => $token
]);
$file = $result->fetch();
if ($file == null)
{
die("File not found.");
}
// download the file
$url_encoded_file_name = rawurlencode($file["file_name"]);
$file_url = "http://localhost:8888/tutorials/collect-emails-while-downloading-files-php-mysql/uploads/" . $url_encoded_file_name;
// die($file_url);
// headers to download any type of file
header('Content-Description: File Transfer');
header('Content-Type: application/octet-stream');
header('Content-Disposition: attachment; filename="' . $file["file_name"] . '"');
header('Expires: 0');
header('Cache-Control: must-revalidate');
header('Pragma: public');
header('Content-Length: ' . filesize($file["file_path"]));
readfile($file["file_path"]);
Make sure to change your base URL at line 26. Now you can run a complete test cycle again. Upload a file, click the download button, and enter your email. Check your email and then click the download link from the email and the file will be downloaded. Verify the file that is downloaded correctly.
So that’s how you can collect a large number of emails by allowing people to simply download files. You can create a very large email list from it.
Learn how to send attachment with an email using PHP.
In this article, we are going to create a web crawler using Node JS and Mongo DB. It will take a URL as an input and fetch all the anchor tags, headings, and paragraphs. You can add more features to it if you want.
Requirements
Make sure you have the following things installed in your system:
Node JS
Mongo DB
Code Editor (Sublime Text etc.)
Video tutorial:
Setup the Project
First, create an empty folder anywhere in your system. Create a file named server.js in that folder. Open CMD in that folder by running the following command:
cd "path_of_your_folder"
We are going to need multiple modules for this web crawler. So, install them from the command:
Now we explain the reason for the installation of the above modules.
express framework is used for routing.
http is used to run HTTP requests.
ejs is a template engine used for rendering HTML files.
socket.io is used for realtime communication.
request is used to fetch content of web page.
cheerio is used for jQuery DOM manipulation.
express-formidable to get values from FormData object.
mongodb will be our database.
htmlspecialchars is used to convert HTML tags into entities.
node-html-parser to convert the HTML string into DOM nodes.
After all the modules are installed, run the following command to start the server:
npm install -g nodemon
nodemon server.js
Start the server
Open your server.js and write the following code in it to start the server at port 3000.
var express = require("express");
var app = express();
var http = require("http").createServer(app);
http.listen(3000, function () {
console.log("Server started running at port: 3000");
});
Now your project will be up and running at http://localhost:3000/
Connect Node JS with Mongo DB
To connect Node JS with Mongo DB, first, we need to create an instance of Mongo DB client in our server.js. Place following lines before http.listen function.
var mongodb = require("mongodb");
var mongoClient = mongodb.MongoClient;
var ObjectID = mongodb.ObjectID;
var database = null;
Now write the following code inside the http.listen callback function.
If you check your CMD now, you will see the message “Database connected”.
Crawl the web page
Now we need to create a form to get the URL as an input. So first we will tell our express app that we will be using EJS as our templating engine. And all our CSS and JS files will be inside the public folder. Place following lines before http.listen function.
Now create 2 folders at the root of your project, “public” and “views”. Download the latest jQuery, Bootstrap, DataTable, and Socket IO libraries and placed their files inside the public folder. Create a new file named index.ejs inside views folder. Create a GET route in our server.js when the Mongo DB is connected.
app.get("/", async function (request, result) {
result.render("index");
});
If you access your project from the browser now, you will see an empty screen. Open your index.ejs and write the following code in it:
You will now see a simple form with an input field and a submit button. In that input field, you can enter the URL of the page you wanted to crawl. Now we need to create a Javascript function that will be called when the form is submitted. In that function, we will call an AJAX request to the Node JS server.
<script>
function crawlPage(form) {
var ajax = new XMLHttpRequest();
ajax.open("POST", "/crawl-page", true);
ajax.onreadystatechange = function () {
if (this.readyState == 4) {
if (this.status == 200) {
// console.log(this.responseText);
var data = JSON.parse(this.responseText);
// console.log(data);
}
}
};
var formData = new FormData(form);
ajax.send(formData);
return false;
}
</script>
Get the web page content
To fetch the content of the web page, first, we will use express-formidable as our middleware. Also, we will require the modules required to read the web page and convert its HTML into DOM nodes. Write the following lines before http.listen function.
After that, we will create a POST route to crawl the web page.
app.post("/crawl-page", async function (request, result) {
var url = request.fields.url;
crawlPage(url);
result.json({
"status": "success",
"message": "Page has been crawled",
"url": url
});
});
Our web crawler runs in a separate function to crawl the web page. Then we will create the functions to crawl the web page and save its content in Mongo DB. Write the following functions at the top of your server.js file.
function getTagContent(querySelector, content, pageUrl) {
var tags = content.querySelectorAll(querySelector);
var innerHTMLs = [];
for (var a = 0; a < tags.length; a++) {
var content = "";
var anchorTag = tags[a].querySelector("a");
if (anchorTag != null) {
content = anchorTag.innerHTML;
} else {
content = tags[a].innerHTML;
}
content = content.replace(/\s+/g,' ').trim();
if (content.length > 0) {
innerHTMLs.push(content);
}
}
return innerHTMLs;
}
function crawlPage(url, callBack = null) {
var pathArray = url.split( '/' );
var protocol = pathArray[0];
var host = pathArray[2];
var baseUrl = protocol + '//' + host;
io.emit("crawl_update", "Crawling page: " + url);
requestModule(url, async function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
// Get text
// console.log("------- with request module -------")
// console.log($.text());
// Get HTML
// console.log($.html());
var page = await database.collection("pages").findOne({
"url": url
});
if (page == null) {
var html = $.html();
var htmlContent = HTMLParser.parse(html);
var allAnchors = htmlContent.querySelectorAll("a");
var anchors = [];
for (var a = 0; a < allAnchors.length; a++) {
var href = allAnchors[a].getAttribute("href");
var title = allAnchors[a].innerHTML;
var hasAnyChildTag = (allAnchors[a].querySelector("div") != null)
|| (allAnchors[a].querySelector("img") != null)
|| (allAnchors[a].querySelector("p") != null)
|| (allAnchors[a].querySelector("span") != null)
|| (allAnchors[a].querySelector("svg") != null)
|| (allAnchors[a].querySelector("strong") != null);
if (hasAnyChildTag) {
continue;
}
if (href != null) {
if (href == "#" || href.search("javascript:void(0)") != -1) {
continue;
}
var first4Words = href.substr(0, 4);
if (href.search(url) == -1 && first4Words != "http") {
if (href[0] == "/") {
href = baseUrl + href;
} else {
href = baseUrl + "/" + href;
}
}
anchors.push({
"href": href,
"text": title
});
}
}
io.emit("crawl_update", htmlspecialchars("<a>") + " tags has been crawled");
var titles = await getTagContent("title", htmlContent, url);
var title = titles.length > 0 ? titles[0] : "";
io.emit("crawl_update", htmlspecialchars("<title>") + " tag has been crawled");
var h1s = await getTagContent("h1", htmlContent, url);
io.emit("crawl_update", htmlspecialchars("<h1>") + " tags has been crawled");
var h2s = await getTagContent("h2", htmlContent, url);
io.emit("crawl_update", htmlspecialchars("<h2>") + " tags has been crawled");
var h3s = await getTagContent("h3", htmlContent, url);
io.emit("crawl_update", htmlspecialchars("<h3>") + " tags has been crawled");
var h4s = await getTagContent("h4", htmlContent, url);
io.emit("crawl_update", htmlspecialchars("<h4>") + " tags has been crawled");
var h5s = await getTagContent("h5", htmlContent, url);
io.emit("crawl_update", htmlspecialchars("<h5>") + " tags has been crawled");
var h6s = await getTagContent("h6", htmlContent, url);
io.emit("crawl_update", htmlspecialchars("<h6>") + " tags has been crawled");
var ps = await getTagContent("p", htmlContent, url);
io.emit("crawl_update", htmlspecialchars("<p>") + " tags has been crawled");
var object = {
"url": url,
"anchors": anchors,
"title": title,
"h1s": h1s,
"h2s": h2s,
"h3s": h3s,
"h4s": h4s,
"h5s": h5s,
"h6s": h6s,
"ps": ps,
"time": new Date().getTime()
};
try {
await database.collection("pages").insertOne(object);
} catch (e) {
console.log(e);
}
io.emit("page_crawled", object);
io.emit("crawl_update", "Page crawled.");
} else {
io.emit("crawl_update", "Page already crawled.");
}
if (callBack != null) {
callBack();
}
}
});
}
If you refresh the page now and enter the URL of any web page and hit enter, you will see its content is stored in the Mongo DB database named web_crawler. To check the data from Mongo DB, you can download a software named Mongo DB Compass.
Show data in DataTable
Now whenever a new web page is crawled, we will display that in a table. We will be using a library called DataTable. We will also include the socket IO library for real-time communication. So include those files in your index.ejs:
Then we will create a row with 2 columns. On the left column, we will create a table to display all crawled tables. And on the right column, we will display all crawled updates e.g. “headings has been crawled”, “paragraphs” has been crawled” etc.
Then we need to initialize the data table library. Also, attach event listeners for crawl updates. Crawl updates will ab prepended in the <ul> list. The complete crawled web pages will be appended in the data table.
var table = null;
var socketIO = io("http://localhost:3000/");
var months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"];
window.addEventListener("load", function () {
table = $('#my-table').DataTable({
"order": [[ 2, "asc" ]]
});
});
socketIO.on("crawl_update", function (data) {
// console.log(data);
var html = "";
html += `<li class="list-group-item">` + data + `</li>`;
document.getElementById("my-updates").innerHTML = html + document.getElementById("my-updates").innerHTML;
document.getElementById('my-updates').scrollTop = 0;
});
socketIO.on("page_crawled", function (data) {
// console.log(data);
var date = new Date(data.time);
var time = date.getDate() + " " + months[date.getMonth() + 1] + ", " + date.getFullYear() + " - " + date.getHours() + ":" + date.getMinutes() + ":" + date.getSeconds();
table.row.add( [
"<a href='/page/" + encodeURIComponent(data.url) + "'>" + data.url + "</a>",
data.title,
time
] ).draw( false );
});
Now you will see the data in the table when you crawl some page. You can crawl as many pages as you want.
Fetch data from Mongo DB
At this point, data in the data table is only displayed when you crawl some page. But when you reload the page, the data table will be empty. However, the data is still stored in the database. Our web crawler has saved all the crawled pages in a Mongo DB collection named “pages”. So we need to populate the previously saved pages from the database in the data table when the page loads.
First, change our “/” GET route in the server.js to the following:
app.get("/", async function (request, result) {
var months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"];
var pages = await database.collection("pages").find({})
.sort({
"time": -1
}).toArray();
for (var index in pages) {
var date = new Date(pages[index].time);
var time = date.getDate() + " " + months[date.getMonth() + 1] + ", " + date.getFullYear() + " - " + date.getHours() + ":" + date.getMinutes() + ":" + date.getSeconds();
pages[index].time = time;
}
result.render("index", {
"pages": pages
});
});
And in our index.ejs inside the <tbody> tag, we will display all the pages.
If you refresh the page now, you will see all pages in the data table. You will only see the URL, title, and the time when the page was recently crawled. But we need to know the anchor tags on that page, all the headings, and paragraphs in it.
Show page content
Click on any of the links from the data table and it will take you to an error page. We need to convert that error page into a detailed page. Create a GET route in our server.js that will fetch the page from the database and send it to an HTML file.
app.get("/page/:url", async function (request, result) {
var url = request.params.url;
var page = await database.collection("pages").findOne({
"url": url
});
if (page == null) {
result.render("404", {
"message": "This page has not been crawled"
});
return false;
}
result.render("page", {
"page": page
});
});
In your views folder, create a file named 404.ejs that will be displayed when the URL is not been crawled yet.
Along with all the data on the web page, it will also show 2 buttons to “delete” and to “reindex”. Delete simply means to delete the page from the database. “Reindex” means to re-crawl the web page to fetch updated content. First, we will create a POST route for deleting the page in our server.js file.
app.post("/delete-page", async function (request, result) {
var url = request.fields.url;
await database.collection("pages").deleteOne({
"url": url
});
io.emit("page_deleted", url);
var backURL = request.header('Referer') || '/';
result.redirect(backURL);
});
And in our index.ejs we will attach an event listener that will be called when the page is deleted. In that function, we will simply remove that row from the data table.
Remove specific row from DataTable.js
socketIO.on("page_deleted", function (url) {
table
.rows( function ( idx, data, node ) {
return data[0].includes(url);
} )
.remove()
.draw();
});
This will search for the first row with the URL in its content and remove it. After removal, it will re-render the data table to reload the table.
Re-index the page
Now we need to add a function to re-index the page, which means to get the updated content of the page. As we did for delete, we will also create a form for re-indexing.
<div class="col-md-1">
<form method="POST" action="/reindex" onsubmit="return confirm('Are you sure you want to re-index this page ?');">
<input type="hidden" name="url" value="<%= page.url %>" required />
<input type="submit" class="btn btn-primary" value="Re-index" />
</form>
</div>
This will show a “Re-index” button along with a delete button. Then we need to create a POST route in our server.js:
app.post("/reindex", async function (request, result) {
var url = request.fields.url;
await database.collection("pages").deleteOne({
"url": url
});
io.emit("page_deleted", url);
crawlPage(url, function () {
var backURL = request.header('Referer') || '/';
result.redirect(backURL);
});
});
That’s it. If you run the code now, you will be able to re-index the page and update your database. You can try this feature by first crawling a URL today. Then try to crawl the same URL again a few days or weeks later.
So that’s how you can create a simple web crawler in Node JS and Mongo DB. You can check our financial ledger tutorial to learn how to create a financial ledger in Node JS and Mongo DB.