My first HTML document

007-003. structure of scrapy, introduction to scraping web @ Spider class You designate 'scraping rule' and 'crawling rule' in Spider class Selector class You can select specific element from html code Selector class was built based on lxml(parsing library) You can use css selector and xpath(recommended) Item class This is custom data structure which is used when stroing scraped page Item Pipeline class You designate rules how to deal with each item in Item Pipeline class Item is processed by this rule Settings class You can configure detailed settings for Spider and Item Pipeline You can use robots.txt in Settings class # @ # activate py27 scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books" # Open chrome developer tool # GO to element tab(shift c) # Copy element as xpath or anything # On scrapy shell, response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div') # Object response stores scraped information # "http://www.dmoz.org/Computers/Programming/Languages/Python/Books" # Now, you will get Selector object which contains corresponding element # @ # XPath concept # element//*condition # Find all descendents with all kind(*) of element, from them, select elements which satisfies condition # element/*condition # Find direct child with all kind(*) of element, from them of all kind(*), select elements which satisfies condition # Index in xpath starts with 1 # @ # '//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()' # element: html # //: find all descendent with all kind(*) of html # Use example response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div') response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()') response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()').extract() # < ['Core Python Programming '] response.xpath('//*[@id="site-list-content"]/div[1]/div[3]/a/div/text()').extract()[0] # < 'Core Python Programming ' # @ # We can bring all 'title text' by using for loop titles = response.xpath('//*[@id="site-list-content"]/div') title = titles[0] # title.xpath('./div[3]/a//div/text()') for title in titles: print(title.xpath('./div[3]/a//div/text()'))