Website Classification from Webpage Renders

Leonardo Espinosa-Leal*, Anton Akusok, Amaury Lendasse, Kaj-Mikael Björk

*Motsvarande författare för detta arbete

Forskningsoutput: Kapitel i bok/rapport/konferenshandlingKapitelVetenskapligPeer review

Sammanfattning

In this paper, we present a fast and accurate method for the classification of web content. Our algorithm uses the visual information of the main homepage saved in an image format by means of a full body snapshot. Sliding windows of different sizes and overlaps are used to obtain a large subset of images for each render. For each sub-image, a feature vector is extracted by means of a pre-trained deep learning model. A Extreme Learning Machine (ELM) model is trained for different values of hidden neurons using the large collection of features from a curated dataset of 5979 webpages with different classes: adult, alcohol, dating, gambling, shopping, tobacco and weapons. Our results show that the ELM classifier can be trained without the manual specific object tagging of the sub-images by giving excellent results in comparison to more complex deep learning models. A random forest classifier was trained for the specific class of weapons providing an accuracy of 95% with a F1 score of 0.8.
OriginalspråkEngelska
Titel på gästpublikationProceedings of ELM2019
RedaktörerJiuwen Cao, Chi Man Vong, Yoan Miche, Amaury Lendasse
Antal sidor10
FörlagSpringer
Utgivningsdatum2021
Sidor41-50
ISBN (tryckt)978-3-030-58988-2
ISBN (elektroniskt)978-3-030-58989-9
DOI
StatusPublicerad - 2021
MoE-publikationstypA3 Del av bok eller annat samlingsverk

Publikationsserier

NamnProceedings of ELM2019
Volym14
ISSN (tryckt)2363-6084
ISSN (elektroniskt)2363-6092

Fingeravtryck Fördjupa i forskningsämnen för ”Website Classification from Webpage Renders”. Tillsammans bildar de ett unikt fingeravtryck.

Citera det här