My first HTML document

007. install phantomjs and selenium # @ # We will learn how to scrap web data by using "web browser" # @ # What we've learned for scraping is imitating request pattern and sending imitated request # And we bring html code and analyze, parse, process, extract web data from html code # @ # But there is case that we can't extract html code (like facebook site) even with imitating request # @ # On facebook post, press ctrl shift i to inspect web page # You can see various elements in here # However, you actually can't see those elements in real html code after pressing F12 # The reason for this is that facebook page doesn't show the code for the content # It first shows skin page by html code and then after that, facebook page shows contents by using JavaScript after bringing that contents # Therefore, only after running JavaScript, we can scrap that contents # @ # selenium is used to control web browser remotely # selenium is a sort of python module # which can access to PhantomJS(web browser without screen), firefox, chrome # In other words, we explore web page via PhantomJS web browser, then, # we bring that explored web page by selenium # @ # docker pull ubuntu:16.04 # docker run -it ubuntu:16.04 # apt-get update # You install python3 and python3-pip # apt-get install -y python3 python3-pip # pip3 install selenium # pip3 install beautifulsoup4 # apt-get install -y wget libfontconfig # You create /home/root/src folder and move to that folder cd $_ # mkdir -p /home/root/src && cd $_ # I get PhantomJS # wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 # unzip file # tar jxvf phantomjs-2.1.1-linux-x86_64.tar.bz2 # You move to that created folder # cd phantomjs-2.1.1-linux-x86_64/bin/ # You copy files in this folder into /usr/local/bin/ # cp phantomjs /usr/local/bin/ # You install font # apt-get install -y fonts-nanum* # docker ps -a # You save above image # docker commit imageid ubuntu-phantomjs