In this paper, we present a fast and accurate method for the classification of web content. Our algorithm uses the visual information of the main homepage saved in an image format by means of a full body snapshot. Sliding windows of different sizes and overlaps are used to obtain a large subset of images for each render. For each sub-image, a feature vector is extracted by means of a pre-trained deep learning model. A Extreme Learning Machine (ELM) model is trained for different values of hidden neurons using the large collection of features from a curated dataset of 5979 webpages with different classes: adult, alcohol, dating, gambling, shopping, tobacco and weapons. Our results show that the ELM classifier can be trained without the manual specific object tagging of the sub-images by giving excellent results in comparison to more complex deep learning models. A random forest classifier was trained for the specific class of weapons providing an accuracy of 95% with a F1 score of 0.8.
|Titel på gästpublikation||Proceedings of ELM2019|
|Redaktörer||Jiuwen Cao, Chi Man Vong, Yoan Miche, Amaury Lendasse|
|Status||Publicerad - 2021|
|MoE-publikationstyp||A3 Del av bok eller annat samlingsverk|
|Namn||Proceedings of ELM2019|