[HN Gopher] How to Crawl the Web with Scrapy
___________________________________________________________________
How to Crawl the Web with Scrapy
Author : babblingfish
Score : 98 points
Date : 2021-09-13 18:34 UTC (4 hours ago)
(HTM) web link (www.babbling.fish)
(TXT) w3m dump (www.babbling.fish)
| aynyc wrote:
| I used scrapy a lot. Just my opinion:
|
| 1. Instead of creating a urls global variable, use start_requests
| function.
|
| 2. Don't use beautifulsoup to parse, use CSS or XPATH.
|
| 3. If you are going into multiple pages over and over again, use
| CrawlSpider with Rule.
| sintezcs wrote:
| Can you please give some details about your second point?
| What's wrong with beautifulsoup?
| estebarb wrote:
| It is very slow. But personally, I prefer to write my
| crawlers in Go (custom code, not Colly).
| zatarc wrote:
| Try Parsel: https://github.com/scrapy/parsel
|
| It's way faster and has better support for CSS selectors.
| aynyc wrote:
| Using CSS & XPATH to select elements is very natural to web
| pages. BS4 has very limited CSS selector support and zero
| XPATH support.
| tamaharbor wrote:
| Any suggestions regarding how to scrape Java-based websites? (For
| example, harness racing entries and results from:
| https://racing.ustrotting.com/default.aspx).
| jcun4128 wrote:
| What's wrong with it? That seems like a server-side rendered
| page/easier to deal with than waiting for JS to load.
| artembugara wrote:
| We have to crawl about 60-80k news websites per day [0].
|
| I've spent about 1 month to test how scrapy could be a fit for
| our purposes. And, quite surprisingly, it was hard to design a
| distributed web crawler. Scrapy is great for those in-the-middle
| tasks where you need to crawl a bit + process data on the go.
|
| We ended up just using requests to crawl the web. Then post-
| process the web pages in the next step.
|
| Many thanks to Zyte [1] (ex-ScrapingHub) for open-sourcing so
| many wonderful tools for us. I've spoke to Zyte's CEO, and was
| really fascinated how he still being a dev person while running
| such a big company.
|
| [0] https://newscatcherapi.com/news-api [1] https://www.zyte.com/
| adamqureshi wrote:
| ok so i can just hire zyte to build me a custom scraper?
| artembugara wrote:
| well, I think it is the cheapest & the fastest way, tbh.
| adamqureshi wrote:
| Thank you. The other site is also very interesting. I am
| working on this MVP and its news aggregator type site for a
| NICHE product. So i need to aggregate news for a brand from
| maybe 10-20 blogs and list the URL. thank you for sharing
| both. I'll reach out to them.
| artembugara wrote:
| I'm the co-founder of the other one, we could help you
| with your task.
|
| Feel free to contact me. artem [at] newscatcherapi.com
| adamqureshi wrote:
| Oh awesome! emailing you now. Thank you.
| jcun4128 wrote:
| > We have to crawl about 60-80k news websites per day [0]
|
| Can't even imagine that number... different languages or
| something?
| artembugara wrote:
| Yeah, plus many of websites are quite niche news
| (construction news, for example)
| no_time wrote:
| While a decent post, this is more or less inadaquate in 2021. Do
| a post on bypassing cloudflare/other anti botting tech using
| residental proxy swarms
| matheusmoreira wrote:
| Yeah. I hate cloudflare, captchas. Why can't these companies
| accept that our scrapers are valid user agents? Only Google is
| allowed to do it, nobody else.
| Eikon wrote:
| Because most scrapers aren't providing any value to website
| owners, in fact, they are costing them, unlike google.
| r_singh wrote:
| Exactly!
|
| While there are scraping APIs that unblock requests and charge
| for them, I'd love to learn more about how they work....
| wswope wrote:
| Scraping is a cat and mouse game that'll vary a lot by site.
| I'm far from an expert and welcome correction here, but the
| two big tricks that'll go a long way AFAIK are using a
| residential proxy service (never tried one - they tend to be
| quite shady), and using a webdriver-type setup like Selenium
| or Puppeteer to mock realistic behavior (though IIRC you have
| to obfuscate both those systems since they're detectable via
| JS).
| [deleted]
| Eikon wrote:
| They use residential proxies with altered clients and / or
| headless browsers. Cloudflare's bot protection mostly makes
| use of TLS fingerprinting, and thus pretty easy to bypass.
| mmerlin wrote:
| Yes, Scrapy is quite a good scraper technology for some
| features, especially caching, but for some websites it's like
| doing things the hard way...
|
| The easiest scraper with a proxy rotator I've found is in my
| current fave web-automator, scraper scripter and scheduler:
| Rtila [1]
|
| Created by an indy/solo developer-on-fire cranking out user-
| requested features quite quickly... check the releases page [2]
|
| I have used (or at least trialled) the vast majority of
| scraper-tech and written hundreds of scrapers since my first
| VB5 controlling IE then dumping to SQLserver in the 90's and
| then moving to various php and python libs/frameworks and a
| handful of windows apps like ubot and imacros (both of which
| were useful to me at some point but I never use those nowadays)
|
| A recent release of Rtila allows creating standalone bots you
| can run using it's built-in local Node.js server (which also
| has it's own locally hosted server API you can program anything
| else against using any language you like)
|
| [1] https://www.rtila.net
|
| [2] https://github.com/IKAJIAN/rtila-releases/releases
| Lammy wrote:
| I'm sure Rtila is fantastic at what it does, but I gotta say
| it's hilarious to see a landing page done in the Corporate
| Memphis artstyle but worded in euphemism:
| https://www.rtila.net/#h.d30as4n2092u
|
| "'Cause if the web server said no, then the answer obviously
| is no. The thing is that it's not _gonna_ say no--it'd never
| say no, because of the innovation. "
| r_singh wrote:
| I've used Scrapy extensively for writing crawlers.
|
| There's a lot of good things like not having to worry about
| storage backends, request throttling (random seconds between
| requests), the ability to run parallel parsers easily. There is
| also a lot of open source middleware to help with things like
| retrying requests with proxies and rotating user agents.
|
| However, like any battery included framework it has downsides in
| terms of flexibility.
|
| In most cases requests and lxml should be enough to crawl the
| web.
| aynyc wrote:
| If you are just doing one or two pages, say you want to get
| weather for your location, then requests is sufficient. But if
| you want to do many pages where you might want to scan and
| follow, requests gets tedious very quickly.
| r_singh wrote:
| If you're a web developer not really, rather than worrying
| about storage backendes, spiders, yielding and managing loops
| and items, you could just host a DRF or Flask API with your
| scrapers (written in Requests+lxml) initiated with an API
| request.
|
| I guess it's a matter of preference
| ducktective wrote:
| > In most cases requests and lxml should be enough to crawl the
| web.
|
| Don't mind my `curl | pup xmlstarlet grep(!!)`s... Nothing to
| see here...
| amozoss wrote:
| My brother-in-law had just finished his pilot training and
| was trying to apply for a job as teacher to continue his
| training.
|
| However, the jobs were first come, first serve so he was
| waking up at 4 am and constantly refreshing for hours trying
| to be the first one.
|
| When I heard about it, I quickly whipped up a `curl | grep &&
| send_notif` (used pushback.io for notifs) and it helped him
| not have to worry so much.
|
| When a new job posting finally came along he was the first in
| line and got the job :)
| davidatbu wrote:
| Is the complete example (ie, a git repo or the python file)
| linked anywhere in the blog post?
| babblingfish wrote:
| That's a good idea, I added a link to download a python file
| with all the code at the end of the article.
| question002 wrote:
| Like who upvotes this? We actually have programming news here
| too. It's just funny we're supposed to believe stuff like Rust is
| ever going to catch on, when 90% of the interest on this site
| stuff is just doing simple scripting tasks.
| yewenjie wrote:
| Related question - what is a very fast and easy to use library
| for scraping static sites such as Google search results?
| zamadatix wrote:
| Google search isn't a static site, the results are dynamically
| generated based on what it knows about you (location, browser
| language, recent searches from IP, recent searches from
| account, and so on with all of the things they know from trying
| to sell ad slots to that device).
|
| That being said there isn't anything wrong with using Scrapy
| for this. If you're more familiar with web browsers than Python
| something like https://github.com/puppeteer/puppeteer can also
| be turned into a quick way to scrape a site by giving you a
| headless browser controlled by whatever you script in nodejs.
| yewenjie wrote:
| I see. I am familiar with Python but I don't need something
| so heavy like Scrapy. Ideally I am looking for something that
| is very lightweight + fast and can just parse the DOM using
| CSS selectors.
| paulcole wrote:
| I've had excellent luck with SerpAPI. It's $50 a month for
| 5,000 searches which has been plenty for my needs at a small
| SEO/marketing agency.
|
| http://serpapi.com
| wirthjason wrote:
| I love scrapy! It's a wonderful tool.
|
| One of the most underrated features is the request caching. It
| really helps with the problem of finding out your spider crashed
| or you didn't parse all the data you wanted and rerunning the
| job. Rather than making hundred or thousands of requests you can
| get them from the cache.
|
| One nitpick is that the documentation could be a bit better about
| integrating scrapy with other Python projects / code rather than
| running it directly from the command line.
|
| Also, some of their internal names are a bit vague. There's a
| Spider and a Crawler. What's the difference? To most people these
| would be the same thing. This makes reading the source code a
| little tricky.
___________________________________________________________________
(page generated 2021-09-13 23:00 UTC)