Ultimate goal is to classify urls just based on the urls, not using other features.
There are two tasks:
one is a url classification project with using Pytorch and RoBERTa, fine-tuned with descriptions from DMOZ, with a DMOZ dataset and other 2 datasets.
The other task is using sentence-transformer to predict a url from meta-description. Implementing MultiNegativeRankingLoss.
Extra information will be shared via email.
[login to view URL]
Description.
1. Datasets
a. Mainly using the DMOZ dataset.
[login to view URL]
b. Malicious URLs dataset.
[login to view URL]
c. URL dataset (ISCX-URL2016)
[login to view URL]
d. Detecting Malicious URLs
[login to view URL]
e. ANT Datasets
[login to view URL]
Bottom four datasets are for comparison for classification.
2. Experimental setting
Task 1.
This will be just a simple genre classification of urls.
a. Using RoBERTa-base, RoBERTa-large models run the genre or phish
classification.
b. Only using urls first, split the urls by '/', then punctuations, then
word segmenter in python, for last Universal Word Segmentation
([login to view URL]).
Github for Universal Word Segmentation:
[login to view URL]
c. Fine-tune RoBERTa models with the descriptions from DMOZ.
d. Run the models again.
e. Result tables and implemented equations are required here.
Task 2.
This will be basically predicting urls from descriptions.
a. Using sentence_transformers embed the DMOZ's descriptions to model(From
pre-trained models use "all-mpnet-base-v2" and "all-MiniLM-L6-v2". Starting
from scratch, which means building models, use RoBERTa-base.
b. After embedding descriptions with matching urls run the
sentence-transformers(Look at the usages in the following link:
[login to view URL]).
c. For the loss function try to use BatchAllTripletLoss,
BatchHardSoftMarginTripletLoss, MultipleNegativesRankingLoss, TripletLoss.
d. Need comparison table of each model and loss function.
Task 3.
This will be a combined work of task 1 and 2.
With random Description predict the url and classify the url's genre.
* Equation for loss functions and some sequential explanation of models is
needed. For example, we can implement a fully connected dense layer with
some activation after pooling layer for sentence-transformers.
it has to be of masters level, have an abstract, with APA formatting, and at least 20 references with proper intext citations. also, do it in US English.
HI,
Its an easy task for us.
We have experienced developer in php c programing web scramping .
We are operating since 2012 .
Please come on chat to discuss the project in detail.
Project Milestones will be decided during chat.
Thank You
Regards:
Arpit Jain
Black Grapes Softech