I want someone to do a theme analyses around 5 million comments on a video sharing website using PySpark Ml library as the main tool. I will provide the dataset. The work environment should be Databricks Community Edition (you can create an account for free), and the deliverable is a Databricks notebook.
The data is at “video_creator – commentor_id – comment” granularity. What I want you to do is the following:
1. Remove comments that are not written in English.
2. For each commentor_id, append all his/her comments into one feature, call it “all_comments”. That is, aggregate the granularity of dataset into commentor_id – all_comments granularity
3. Transform the “all_comments” feature using Word2Vec modules of PySpark ML library (not the MlLib library as I want to do everything using dataframes)
4. Do a clustering of the transformed “all_comments” feature using the LDA module of PySpark ML.
5. Generate the most frequent words for each cluster as identified in field. I will do the interpretation of the results, and you don’t need to worry about it.
So overall, it’s a straightforward task of data clean, aggregation, and application of standard PySpark ML modules.
I estimate this project to take 2 to 3 hours of programming for someone good at Python and PySpark. I hope to get the project done in 3 days, up to 6 days is acceptable. If you place your bid, I will share with you the link to the data file. I don't have other instructions other than those five steps listed above.
11 freelancers are bidding on average $232 for this job
I hope to see you in chat. Though I am new to freelancer.com I am an experienced python developer with full-stack knowledge and career. I'm sure I can do this perfectly. Thanks for your kind attention.
Hello! I am a python developer. I looked at your project and it seems interesting. I have all necessary skills required for this project. Ping me to discuss in detail.