计算机代考程序代写 python database crawler Semester 2, 2021 – cscodehelp代写

Semester 2, 2021
Lecture 4, Part 8: Web Crawling

WWW – A Repository of Data
A large amount of text data is on the Web.
Web crawling is a method to get the data from the web.

Web Crawling
• Web crawlers are also known as spiders, robots, and bots.
• Crawlers attempt to visit every page of interest and retrieve them for
processing and indexing
• Basic challenge: there is no central index of URLs of interest.
• Secondary challenges:
• same content as a new URL
• never return status ‘done’ on access
• websites that are not intended to be crawled
• content generated on-the-fly from databases → costly for the content provider → excessive visits unwelcome
• Some content has a short lifespan

Crawling
• • The web is a highly linked graph
• If a web page is of interest, there will be a link to it from another page.
• In principle:
• Create a prioritised list L of URLs
to visit (seed URLs)
• Create a list V of URLs that have • been visited and when.
Repeat forever:
• Choose a
URL u from L and fetch the page p(u) at location u.
• Parse and index p(u) Extract URLs {u′} from p(u).
• Add u to V and remove it from L
Add {u′} − V to L.
• Process V to move expired or ‘old’ URLs to L.

Crawling
• Web crawl (for the purpose of indexing), every page is visited eventually.
• Synonym URLs are disregarded.
• Significant or dynamic pages are visited more frequently,
• The crawler mustn’t cycle indefinitely in a single web site.
Crawling (site or the web) vs Scraping
• Crawling starts with seed URLs
• Crawling visits all sites within a linked graph
• Scraping is the process of extracting data
• Scraping is targeted (given the information that it has to extract)

Anatomy of a URL: Targeted Scraping

Crawling – challenges
• Crawler traps are surprisingly common. For example, a ‘next month’ link on a calendar can potentially be followed until the end of time.
• The Robots Exclusion Standard: protocol that all crawlers are supposed to observe. It allows website managers to restrict access to crawlers while allowing web browsing.
• robots.txt (inclusions, exclusions, sitemaps, etc.)
• See: https://developers.google.com/search/reference/robots.txt
• Ethical and Terms of Use Python scrapy

Parsing
Once a document has been fetched, it must be parsed.
• Web documents can usually be segmented into discrete zones such as title, anchor text, headings, and so on.
• Data can be extracted from specific zones.
• Information such as links and anchors can be analysed, formats such as PDF or Postscript or Word can be translated, and so on.
• Python BeautifulSoup

Leave a Reply

Your email address will not be published. Required fields are marked *