Website Classification from Webpage Renders

Leonardo Espinosa-Leal*, Anton Akusok, Amaury Lendasse, Kaj-Mikael Björk

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingChapterScientificpeer-review

Abstract

In this paper, we present a fast and accurate method for the classification of web content. Our algorithm uses the visual information of the main homepage saved in an image format by means of a full body snapshot. Sliding windows of different sizes and overlaps are used to obtain a large subset of images for each render. For each sub-image, a feature vector is extracted by means of a pre-trained deep learning model. A Extreme Learning Machine (ELM) model is trained for different values of hidden neurons using the large collection of features from a curated dataset of 5979 webpages with different classes: adult, alcohol, dating, gambling, shopping, tobacco and weapons. Our results show that the ELM classifier can be trained without the manual specific object tagging of the sub-images by giving excellent results in comparison to more complex deep learning models. A random forest classifier was trained for the specific class of weapons providing an accuracy of 95% with a F1 score of 0.8.
Original languageEnglish
Title of host publicationProceedings of ELM2019
EditorsJiuwen Cao, Chi Man Vong, Yoan Miche, Amaury Lendasse
Number of pages10
PublisherSpringer
Publication date2021
Pages41-50
ISBN (Print)978-3-030-58988-2
ISBN (Electronic)978-3-030-58989-9
DOIs
Publication statusPublished - 2021
MoE publication typeA3 Book chapter

Publication series

NameProceedings of ELM2019
Volume14
ISSN (Print)2363-6084
ISSN (Electronic)2363-6092

Fingerprint Dive into the research topics of 'Website Classification from Webpage Renders'. Together they form a unique fingerprint.

Cite this