Find Jobs
Hire Freelancers

Genre classification based on pure URL

$30-250 USD

Closed
Posted over 2 years ago

$30-250 USD

Paid on delivery
Ultimate goal is to classify urls just based on the urls, not using other features. There are two tasks: one is a url classification project with using Pytorch and RoBERTa, fine-tuned with descriptions from DMOZ, with a DMOZ dataset and other 2 datasets. The other task is using sentence-transformer to predict a url from meta-description. Implementing MultiNegativeRankingLoss. Extra information will be shared via email. [login to view URL] Description. 1. Datasets a. Mainly using the DMOZ dataset. [login to view URL] b. Malicious URLs dataset. [login to view URL] c. URL dataset (ISCX-URL2016) [login to view URL] d. Detecting Malicious URLs [login to view URL] e. ANT Datasets [login to view URL] Bottom four datasets are for comparison for classification. 2. Experimental setting Task 1. This will be just a simple genre classification of urls. a. Using RoBERTa-base, RoBERTa-large models run the genre or phish classification. b. Only using urls first, split the urls by '/', then punctuations, then word segmenter in python, for last Universal Word Segmentation ([login to view URL]). Github for Universal Word Segmentation: [login to view URL] c. Fine-tune RoBERTa models with the descriptions from DMOZ. d. Run the models again. e. Result tables and implemented equations are required here. Task 2. This will be basically predicting urls from descriptions. a. Using sentence_transformers embed the DMOZ's descriptions to model(From pre-trained models use "all-mpnet-base-v2" and "all-MiniLM-L6-v2". Starting from scratch, which means building models, use RoBERTa-base. b. After embedding descriptions with matching urls run the sentence-transformers(Look at the usages in the following link: [login to view URL]). c. For the loss function try to use BatchAllTripletLoss, BatchHardSoftMarginTripletLoss, MultipleNegativesRankingLoss, TripletLoss. d. Need comparison table of each model and loss function. Task 3. This will be a combined work of task 1 and 2. With random Description predict the url and classify the url's genre. * Equation for loss functions and some sequential explanation of models is needed. For example, we can implement a fully connected dense layer with some activation after pooling layer for sentence-transformers. it has to be of masters level, have an abstract, with APA formatting, and at least 20 references with proper intext citations. also, do it in US English.
Project ID: 32963093

About the project

1 proposal
Remote project
Active 2 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
1 freelancer is bidding on average $250 USD for this job
User Avatar
HI, Its an easy task for us. We have experienced developer in php c programing web scramping . We are operating since 2012 . Please come on chat to discuss the project in detail. Project Milestones will be decided during chat. Thank You Regards: Arpit Jain Black Grapes Softech
$250 USD in 7 days
0.0 (0 reviews)
0.0
0.0

About the client

Flag of PAKISTAN
islamabad, Pakistan
5.0
2
Payment method verified
Member since Sep 29, 2021

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.