004. using basic of BeautifulSoup @ beauty.py from bs4 import BeautifulSoup # BeautifulSoup analyzes and parses html and xml # Suppose we have following html code html=""" <html> <body> <div id="meigen"> <h1>Wikibooks</h1> <ul class="items art it book"> <li>Unity</li> <li>Swift</li> <li>Website</li> </ul> </div> </body> </html> """ # I create BeautifulSoup instance # html : html code # Kind of parser(this one parses html) soup = BeautifulSoup(html, "html.parser") # below 'ul.items', select all(descendant) 'li' list_items = soup.select("ul.items > li") # soup.select("ul.items li") # soup.select("li") # < <li>Unity</li> # < <li>Swift</li> # < <li>Website</li> # Let's select <h1>Wikibooks</h1> header = soup.select_one("body > div > h1") # Let's use extracted data print(header.string) # < Wikibooks # This one is dictionary type # header.attrs['title'] print(soup.select_one("ul").attrs) # < {'class': ['items', 'art', 'it', 'book']} for li in list_items: print(li.string) # < Unity # < Swift # < Website @ # Why should we extract parts from html by using BeautifulSoup? # It's because all web pages are built by html @ # The most used methods are "select()" and "select_one()" # which are used along with CSS selectors # I select specific tag with specific tag name # If I want to select <ul></ul>, I need to input "ul" # This is tag selector in BeautifulSoup : "ul", "div", "li", ... @ <div id="meigen"> # Id selector : #meigen # class selector : .items # If element has multiple classes like <ul class = "items art it book"> # if you want to select element with multiple clases, you can use .item.it @ <ul class="items art it book"> <li>Unity</li> <li>Swift</li> <li>Website</li> </ul> # "ul.book.items" # < <li>Unity</li> # < <li>Swift</li> # < <li>Website</li> # tag + class selection pattern like "ul.book" is most used # If summany, # 1. tag selector # "h1" # "html" # 1. id selector # "#meigen" # 1. class selector # ".book" @ # Let's talk about structure selector <html> <body> <div id="meigen"> <h1>Wikibooks</h1> <ul class="items art it book"> <li>Unity</li> <li>Swift</li> <li>Website</li> </ul> </div> </body> </html> # 1. descendant selector : all elements under every indent below a specific tag # 1. child selector : all elements under one indent below a specific tag # descendant of <html> is <body> <div id="meigen"> <h1>Wikibooks</h1> <ul class="items"> <li>Unity</li> <li>Swift</li> <li>Website</li> </ul> </div> </body> # child of <html> : <body> <body> </body> @ # descendant selector (all elements under every indent) # "html li" means below html, select its descendant li # <li>Unity</li> # <li>Swift</li> # <li>Website</li> # "#meigen li" : below #meigen, select its descendant li # <li>Unity</li> # <li>Swift</li> # <li>Website</li> @ # child selector (all elements under one indent) # "ul.items > li" : below ul.items, select its child li # <li>Unity</li> # <li>Swift</li> # <li>Website</li>