Find Jobs
Hire Freelancers

Frontera + Scrapy very large web crawl

$250-750 USD

Closed
Posted almost 4 years ago

$250-750 USD

Paid on delivery
I need to perform a limited crawl of ~175mm websites. A maximum of four pages will be crawled per website. The crawler will search for three strings (the same three strings) on each page crawled. Basic crawler logic: - Spider the domain specified by seed list/Frontera - Search for the three search strings on the main page - Look for links on the homepage that contain either ‘about’, ‘disclosure’, or ‘disclaimer’ in them - Spider links discovered from the prior step - Search for the three search strings in each page - Loop through this process until all websites have been crawled Data structure requirements: The output data structure (a CSV) will contain a single row for each page that is crawled. The format will be: domain,full_url,http_response_code,matched_text_string,crawl_date_time [login to view URL],[login to view URL],200,1,2020-05-09 21:43:08 [login to view URL],[login to view URL],200,0,2020-05-09 21:43:13 [login to view URL],[login to view URL],200,0,2020-05-09 21:43:16 [login to view URL],[login to view URL],200,0,2020-05-09 17:43:08 [login to view URL],[login to view URL],200,0,2020-05-09 17:43:16 [login to view URL],[login to view URL],200,1,2020-05-09 17:43:22 matched_text_string is a boolean indicating if any of the search strings appeared in the page (1=a match of any of the three strings; 0=no string matched at all). Technical requirements: - Build this with Scrapy + Frontera. Frontera cluster setup: [login to view URL] You can probably custom tailor that walkthrough for this project. - I prefer Ubuntu distribution - The cluster should crawl about 500,000 sites an hour (so, 3-4mm pages/hour). Alexander Sibiryakov (the Frontera core developer) says you can crawl ~1200 pages/min with one single threaded spider. He says you need 1 strategy worker and 1 db working per four spiders (slide 16 of 21 in this deck [login to view URL]). This implies a frontera cluster of somewhere in the ballpark of 72 cores to hit my speed requirement. - Setup a local DNS cache server within the Scrapy + Frontera cluster (Unbounded is recommended in the Frontera docs: [login to view URL]) - The cluster setup script/procedure you deliver will tell me how to double the size of the cluster so I can increase crawl speed to ~1,000,000 sites per hour (e.g., add twice as many resources and run script(s) X, Y, and Z against AWS assets P, Q, and R). You are building this project for: A person that is not a trained developer. Has written 100s of screen scrapers. Coding experience in many languages, esp. Python, C#, and SQL. Very slow to troubleshoot/figure out things in a Linux environment if they are not working. Keep this experience level in mind when thinking about this project. After I select the project winner: - I will give you the first ~2m domains that need to be crawled as a sample for tuning/testing the Scrapy + Frontera cluster. - I will provide the three search strings you will be looking for on the pages that are crawled. Acceptance criteria: - Code for Scrapy + Frontera project that crawls sites and returns data according to the specifications above. - A set of instructions and script(s) to set up a distributed Frontera crawling cluster in AWS. The instructions/script(s) are written/structured as a literal walkthrough with the number and size of EC2/other AWS assets to instantiate, the script(s) to run against each AWS asset to setup everything, the command-line commands necessary to kick off the Scrapy + Frontera crawler once the cluster is setup, etc. - I (1) use your instructions and scripts to set up the cluster, (2) I put the Scrapy/Frontera project in the cluster. Once I am able to set up the cluster with your walkthrough and verify that the cluster (1) crawls fast enough (500k+ sites/hr) and (2) returns the data structure specified above, you are done.
Project ID: 25869915

About the project

9 proposals
Remote project
Active 4 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
9 freelancers are bidding on average $672 USD for this job
User Avatar
Hi, Frontera + Scrapy is not the best setup for larger yet simple scrapes like yours. would you be considering outsourcing this as service where I'll crawl everything for you and will deliver final results in a couple of weeks? saves you aws costs, etc.
$699 USD in 14 days
5.0 (226 reviews)
8.7
8.7
User Avatar
Hi, Thanks for your posting! This is vasilatos. I have looked through your posting and fully understood your concern As a senior full stack developer, I have rich experienced with Scrapy based crawling solutions. I am familiar with the mechanisam and optimized solution based on scrapy framework and core concepts - pipeline, item and spiders and have deep knowledge with frontera policies and middlewares. And also, I have built and provided scrapy-frontera and django integration data mining solution for my clients, this solution is supporting data mining and data analysis for establishing business strategy and marketting strategy of the enterprise. Through your posting, I feel strong speed, high-scalable and extensible requirements regarding what you want with my sw solution architecture design skills, I think that I can help you for the success solution Plz contact me, hope you share the details via chat Looking forward to your positive response Regards
$1,667 USD in 7 days
5.0 (11 reviews)
5.6
5.6
User Avatar
Hi, I am interested in this job. I have carefully read the job description and understood the requirements basically. I think I am a perfect fit for this position. Please contact me when you are available. I would like to discuss more about the project. Regards, Balint
$500 USD in 7 days
4.9 (8 reviews)
3.8
3.8
User Avatar
I have high specialization and experience in the development of software and web platforms. I have the following skills Programming languages: ----- C, C++, C #, Matlab, Java, Python, VB ----- Front-end skills: - PHP, Aspx NET CORE, HTML5, JSON. - JAVASCRIPT (Ajax, AngularJS, ReactJS, ReactNative, Jquery). - CSS3, Bootstrap, Less, Sass, Scss, Responsive, Material. - Mobile applications: React Native, Swift, ObjectiveC. Back-end skills: - Python (Django), ROR, PHP (Framework: CAKEPHP, Yii, Laravel, Ci), Node.js. - MVC, Postgres, MySQL, Rest API. I am a specialized expert. I have 10 years of experience. Please check my independent profile. I can also send samples of my work by chat. can i start right away Best Regards
$500 USD in 1 day
5.0 (3 reviews)
3.4
3.4
User Avatar
Hi, this project had caught my interest the moment it was posted. If I am hired for this project, I will make sure to give a full attention on this project and will deliver it on time. Apart from this, I possess all the skills that you want. I have a great experience using Microsoft Excel and scrapping. I will provide you: * High-Quality Work. * Scrapp skills * Completed as soon as possible. Please contact me with further information. Thank you. Best Regards From Alexandra
$400 USD in 7 days
5.0 (4 reviews)
3.2
3.2
User Avatar
Nice to meet you I am an Amazon Cloud Architect for the web infrastructure serving 90 million page impressions and 12 TB Internet traffic per month. The AWS services I use are EC2, ELB, MySQL RDS, VPC, CloudFront, ElastiCache, CloudWatch, CloudFormation, OpsWorks, ElasticBeanstalk, CodeDeploy, S3, SES, SQS and SNS. I have 20 years of Linux SysAdmin experience. I currently use Apache, Nginx, Ldirectord, MySQL, Perl, PHP, Memcached, Sphinx, Bind, Typo3, WordPress, Send-mail, Postfix, NFS, Samba, Snort, Vsftpd, aide, Nagios, Cacti, Puppet and a bunch of other traditional Linux software. I am good at linux,python,scrapy If you’re looking for a developer that’s truly an expert, driven by passion, not afraid to take on a challenge, and will be there with you every step of the way then look no further as I’m your guy.
$637 USD in 9 days
4.7 (3 reviews)
3.1
3.1
User Avatar
Hi, Hope you are doing good. I have gone through your requirement and I do have skill set you are looking for. I am certified advanced RPA and QA Automation Professional. I have 2+ year of experience with Automation anywhere and successfully delivered 6 bot which are live up and running in production. I also have 10+ year of expertise in Test Automation framewrok designing with Selenium+ java , UFT + VB script and Katalon Studio. I do have very good exposure working with Web,Rest and Soap service, and also desktop based application automation. I also have hands on BDD-Cucumber,Elastic Search,Excel macros, GIT, Gradle,CI CD tools Jenkins,Cloud testing with saucelabs. We can connect and discuss more about your requirements and come up with best solutions. looking forward to working with you. Regards, Anklesh Singh
$649 USD in 7 days
5.0 (2 reviews)
2.7
2.7
User Avatar
Hello and thanks for this opportunity. I am Lesia, an experienced Data Entry, Data Processing, Web Scraper & Web Crawler and I'd like to help you with this project. I will scrape all info like name, phone, address, email, website, and more if required. Message me if you're interested. Thanks.
$500 USD in 7 days
0.0 (0 reviews)
0.0
0.0

About the client

Flag of UNITED STATES
Austin, United States
5.0
9
Payment method verified
Member since Apr 25, 2017

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.