Text Analysis Project using PySpark ML

I want someone to do a theme analyses around 5 million comments on a video sharing website using PySpark Ml library as the main tool. I will provide the dataset. The work environment should be Databricks Community Edition (you can create an account for free), and the deliverable is a Databricks notebook.

The data is at “video_creator – commentor_id – comment” granularity. What I want you to do is the following:

1. Remove comments that are not written in English.

2. For each commentor_id, append all his/her comments into one feature, call it “all_comments”. That is, aggregate the granularity of dataset into commentor_id – all_comments granularity

3. Transform the “all_comments” feature using Word2Vec modules of PySpark ML library (not the MlLib library as I want to do everything using dataframes)

4. Do a clustering of the transformed “all_comments” feature using the LDA module of PySpark ML.

5. Generate the most frequent words for each cluster as identified in field. I will do the interpretation of the results, and you don’t need to worry about it.

So overall, it’s a straightforward task of data clean, aggregation, and application of standard PySpark ML modules.

I estimate this project to take 2 to 3 hours of programming for someone good at Python and PySpark. I hope to get the project done in 3 days, up to 6 days is acceptable. If you place your bid, I will share with you the link to the data file. I don't have other instructions other than those five steps listed above.

Skills: Data Science, Python, Spark

See more: twitter analysis using pyspark, sentiment analysis spark python, twitter sentiment analysis using spark github, spark streaming twitter python, twitter sentiment analysis scala, twitter sentiment analysis using pyspark, pyspark text classification, sentiment analysis python, conjoint analysis project, online exam project using java, mini project using java script, configure java project using serverxml tomcat, time series analysis project, microprocessor project using pic, set php project using wamp, system analysis design project using vb6, project using jquery database, save text flash project, proposal data analysis project, health information system project using aspnet

About the Employer:
( 28 reviews ) Durham, United States

Project ID: #17903811

11 freelancers are bidding on average $232 for this job


I have a good hands on working with Advanced R and Python and BI tools and technologies, AI, Big Data. I have quite a good knowledge of DL/ML Algorithm , have also developed Dashboards and Web Application. My area of e More

$250 USD in 3 days
(26 Reviews)

Hi I am a very experienced statistician, data scientist and academic writer. I have completed several PhD level thesis projects involving advanced statistical analysis of data. I have worked with data from several comp More

$500 USD in 3 days
(17 Reviews)

I hope to see you in chat. Though I am new to I am an experienced python developer with full-stack knowledge and career. I'm sure I can do this perfectly. Thanks for your kind attention.

$200 USD in 2 days
(27 Reviews)

Hi, dear. nice to meet you. i'm python expert. please discuss more details by chatting. Regards. gao M.

$250 USD in 3 days
(13 Reviews)

Hello! I am a python developer. I looked at your project and it seems interesting. I have all necessary skills required for this project. Ping me to discuss in detail.

$140 USD in 2 days
(22 Reviews)

do kindly let's discuss over chat

$222 USD in 6 days
(22 Reviews)

I have been working as data scientist for more than 4 years during which i implemented numerous machine learning algorithms to solve varied business problems. Moreover, to gain other domain expertise, i have been activ More

$388 USD in 7 days
(2 Reviews)

Hello, Sir. How are you? I have experiences more than 9 years in developing Laravel,node.js,angular.js,react.js and Python Frameworks with mobile apps I will work for you all my best. Thank you in advances for your t More

$155 USD in 3 days
(1 Review)

Hello? I have read your job description carefully. I have python experienced for 7 years. I want to discuss with you via chat. Thanks you, James.

$155 USD in 3 days
(1 Review)
$244 USD in 21 days
(0 Reviews)

i know pyspark... try me... just need a nice review...

$45 USD in 1 day
(0 Reviews)