007-003. structure of scrapy, introduction to scraping web
@
Spider class
You designate 'scraping rule' and 'crawling rule' in Spider class
Selector class
You can select specific element from html code
Selector class was built based on lxml(parsing library)
You can use css selector and xpath(recommended)
Item class
This is custom data structure which is used when stroing scraped page
Item Pipeline class
You designate rules how to deal with each item in Item Pipeline class
Item is processed by this rule
Settings class
You can configure detailed settings for Spider and Item Pipeline
You can use robots.txt in Settings class
# @
# activate py27
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books"
# Open chrome developer tool
# GO to element tab(shift c)
# Copy element as xpath or anything
# On scrapy shell,
response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div')
# Object response stores scraped information
# "http://www.dmoz.org/Computers/Programming/Languages/Python/Books"
# Now, you will get Selector object which contains corresponding element
# @
# XPath concept
# element//*condition
# Find all descendents with all kind(*) of element, from them, select elements which satisfies condition
# element/*condition
# Find direct child with all kind(*) of element, from them of all kind(*), select elements which satisfies condition
# Index in xpath starts with 1
# @
# '//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()'
# element: html
# //: find all descendent with all kind(*) of html
# Use example
response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div')
response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()')
response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()').extract()
# < ['Core Python Programming ']
response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()').extract()[0]
# < 'Core Python Programming '
# @
# We can bring all 'title text' by using for loop
titles = response.xpath('//*[@id="site-list-content"]/div')
title = titles[0]
# title.xpath('./div[3]/a//div/text()')
for title in titles:
print(title.xpath('./div[3]/a//div/text()'))