Python Web scraping using Python Selenium

Python Web scraping using Python Selenium

Definition:


Before understanding Python web scraping, let’s first have a look at what is webscrapping. In simple terms, web scraping is the automated process of extracting something from a website. This automation helps to process the HTML of a web page and manipulate it to convert the web page into a different format, copy it into a database, and many different activities. Web Scraping can be used to access the World Wide Web through HyperText Transfer Protocol. Additionally, this can also be done via Web Browsers. Web Scraping can be done manually also. In Python webscraping, we use Python Selenium for web scraping purposes. Web Scraping is sometimes done for illegal purposes too. If a business relies on its content and business model, scraping can sometimes make them a huge financial loss.


Techniques of Web Scraping:


Accordingly, there are many techniques for web scraping. For example, some of them are listed below–

  • Human Copy Paste: Human Copy Paste is the simplest method of web scraping. This method indulges the manual copy and pasting of data from a web page into a spreadsheet. This can sometimes be the best method of web scraping too. This is because some websites have very thick protection towards automation.
  • Text Pattern Matching: This simple, however compelling web scraping method is based on the UNIX grep command or Regular Expression or known as RegEx, of programming languages like Perl and Python.
  • HTTP Programming: Using socket programming, we can post HTTP requests to a remote webserver to retrieve the static or dynamic web pages.
  • HTML Parsing: As we know, many websites have pages generated dynamically from a source like a database. Here, the page will have a template. Likewise, a program that detects such templates is called a wrapper. So, the information from a specific script can be extracted. This is mainly used in data mining.
  • DOM Parsing: As we know that Internet Explorer and Mozilla Firefox are some full-fledged web browsers. By embedding these web browsers, programs can extract content generated by client-side scripts. XPath can be used in this method.

Some uses of Web Scraping :


Simultaneously, there are many uses of web scraping. For instance, some of them are listed below–

  • Web Data Integrating
  • Data Mining
  • Detecting change in a website
  • Online presence checking
  • Weather Data Monitoring

There are 6 types of testing:


  • Acceptance testing
  • Functional testing
  • Performance testing
  • Regression testing
  • Test-driven development (TDD)
  • Behavior-driven development (BDD)


Languages that support Selenium:


  • Java
  • Python
  • C#
  • Ruby
  • JavaScript
  • Kotlin

Introduction to Selenium


Selenium is a tool that helps us to automate web browsers. Selenium is used for testing web applications. However, it can do much more stuff. It opens up the web browser and does things a normal human being would do. Some of those things include clicking buttons, searching for information, filling the input field even though it is a handy tool. If you scrape a website frequently and for malicious purposes, your IP address may be banned as it is against most of the websites’ terms and services.


A. Installation


Requirements:

Python 2 should be 2.7.9 or higher, OR Python 3 should be 3.4.0 or higher. Pip should also be installed.

Run the below command in terminal.

pip install selenium

Drivers:


Browsers require selenium WebDrivers to an interface. Different browsers have different drivers. To illustrate, some of the most popular drivers are given below.

ChromeChrome Driver
EdgeEdge Driver
FirefoxGecko Driver
OperaOpera Driver
SafariSafari WedDriver is built-in

B. Getting Started


Simple Use :

As you install web driver and pip module, you can start using selenium with python as below:

from selenium import webdriver
from selenium.webriver.common.keys import Keys
# Driver specification and executable path
driver = webdriver.Chrome(executable_path="PATH TO WEBDRIVER")
# Redirects to the specified website
driver.get("copyassignment.com")
#specifies the input element in the website
search_element = driver.find_element_by_id("is-search-input-0")
text = "Turtle" # sends the keys to the input field
search.send_keys("Turtle")

The above code will browse into https://copyassignment.com/, and it will search Turtle in the search box. You can replace the value of the “text” variable.


Writing tests with selenium:

As we know, selenium is mostly used for writing test cases. However, selenium itself does not come with testing tools. For instance, we should use python modules like pytest or nose.

On the contrary, we will be using the unittest module. The following code for Python Web Scraping will check the functionality of search on python.org.

import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class PythonOrgSearch(unittest.TestCase):

    def setUp(self):
        self.driver = webdriver.Firefox()

    def test_search_in_python_org(self):
        driver = self.driver
        driver.get("http://www.python.org")
        self.assertIn("Python", driver.title)
        elem = driver.find_element_by_name("q")
        elem.send_keys("pycon")
        elem.send_keys(Keys.RETURN)
        assert "No results found." not in driver.page_source


    def tearDown(self):
        self.driver.close()

if __name__ == "__main__":
    unittest.main()

Now, if you want to run the above code from shell, follow these steps.

python test_python_org_search.py
.
----------------------------------------------------------------------
Ran 1 test in 15.566s

OK

We can clearly see that the test was successful. In addition, you can even use this in Jupyter or iPython.



Selenium Remote WebDriver for Python Selenium:

If you want to use the remote web driver, then you should have the selenium server running. After the server is running, you can use some of the following examples.

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

driver = webdriver.Remote(
   command_executor='http://127.0.0.1:4444/wd/hub',
   desired_capabilities=DesiredCapabilities.CHROME)

driver = webdriver.Remote(
   command_executor='http://127.0.0.1:4444/wd/hub',
   desired_capabilities=DesiredCapabilities.OPERA)

driver = webdriver.Remote(
   command_executor='http://127.0.0.1:4444/wd/hub',
   desired_capabilities=DesiredCapabilities.HTMLUNITWITHJS)

As we know that the expected capability is a dictionary. Instead, you can specify the values as:

driver = webdriver.Remote(
   command_executor='http://127.0.0.1:4444/wd/hub',
   desired_capabilities={'browserName': 'htmlunit',
                         'version': '2',
                        'javascriptEnabled': True})

C. Navigation

Interaction with HTML:

<input type="text" name="inp" id="fill" class="sam">

In the above case, if you want to navigate the HTML element, then you can use any of the following.

scr = driver.find_element_by_name("inp")
scr = driver.find_element_by_id("fill")
scr = driver.find_element_by_class("sam")
scr = driver.find_element_by_xpath("//input[@id='fill']")
scr = driver.find_element_by_css_selector("input#fill")

NOTE: The ID of an element should be unique. This means that no other element in the webpage should have the same ID. If there is a duplicate ID, selenium will recognize the first one.

Sending Text to input field:

Furthermore, for sending a text to the input field, you should import one more thing. That is likely,

from selenium.webdriver.common.keys import Keys

Now, you can use the above import in the following manner.

scr.send_keys("Sample Text")

Similarly, the table below shows the code for entering the keys on the keyboard.

Key Codes:

ADD ALT ARROW_DOWN
ARROW_LEFT ARROW_RIGHT ARROW_UP
BACKSPACE BACK_SPACE CANCEL
CLEAR COMMAND CONTROL
DECIMAL DELETE DIVIDE
DOWN END ENTER
EQUALS ESCAPE F1
F10 F11 F12
F2 F3 F4
F5 F6 F7
F8 F9 HELP
HOME INSERT LEFT
LEFT_ALT LEFT_CONTROL LEFT_SHIFT
META MULTIPLY NULL
NUMPAD0 NUMPAD1 NUMPAD2
NUMPAD3 NUMPAD4 NUMPAD5
NUMPAD6 NUMPAD7 NUMPAD8
NUMPAD9 PAGE_DOWN PAGE_UP
PAUSE RETURN RIGHT
SEMICOLON SEPARATOR SHIFT
SPACE SUBTRACT TAB


Drag and Drop:

You can use this feature by either moving the specified element or on to another element. For using the drag and drop feature you can use the following code of Python Selenium.

from selenium.webdriver import ActionChains

element = driver.find_element_by_name("source")
target = driver.find_element_by_name("target")

action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()

Moving from frame to frame:

Nowadays, most websites have frames. So, there is a different method to switch between frames on a web page. The method used is “switch_to_window.” For example, le the HTML code be like

<a href="somewhere.html" target="windowName">Click here to open a new window</a>

Then, the python code will be

driver.switch_to_window("windowName")

Popups:

Selenium has built-in support to handle popup boxes. There is an example of a popup handling below:

alert = driver.switch_to.alert

Cookies:

Additionally, selenium also has cookie uses. You can understand the cookies in selenium from the code below.

# Navigating to the domain
driver.get("http://www.example.com")

# Setting the cookie
cookie = {‘name’ : ‘aakriti’, ‘value’ : ‘girl’}
driver.add_cookie(cookie)

# This will output all the available cookies
driver.get_cookies()

Fun Fact: We know that Selenium with Python is a popular way of testing web apps, but selenium supports many different languages. Some of them are C#, Java, C++, etc.

D. Locating

There are many ways to locate elements on a web page in Python Web scraping using Python Selenium. You can use it according to the environment you want. The methods to locate elements on a web page are:

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

If you want to find many elements on a page then you can use the following methods.

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

Note: The above methods will return a list.

These are used in the example below.

from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, '//input[text()="Some text"]')
driver.find_elements(By.XPATH, '//input')

Moreover, there are some attributes available for By class. They are listed below.

ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

1. Locating an element by ID

We should use this when we know the id of an element. If the id provided is wrong, there will be an error called NoSuchElementException. Additionally, if we use this strategy, then selenium will recognize the first element with the ID.

To illustrate let the source of the page be like:

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text">
   <input name="password" type="password">
   <input name="continue" type="submit" value="Login">
  </form>
 </body>
<html>

Here, the form element can be located like:

form = driver.find_element_by_id('loginForm')

2. Locating an element by Name

We should use this when we know the name of an element. If the name provided is wrong, there will be an error called NoSuchElementException. Additionally, if we use this strategy, then selenium will recognize the first element with the name. This is the same as above.

For instance, let the source of a page be:

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text">
   <input name="password" type="password">
   <input name="continue" type="submit" value="Login">
   <input name="continue" type="button" value="Clear">
  </form>
</body>
<html>

Therefore, the python code to locate the username and password in Python Web scraping using Python Selenium will be as follows.

username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')

Moreover, to locate login button, we use the code

continue = driver.find_element_by_name('continue')

Note: Selenium will locate the login button because it occurs before clear.

3. Locating with XPath

XPath is a language that is used to locate nodes in XML documents. XPath supports the simple locating methods by id or name attributes and extends them by opening up all sorts of new possibilities, such as locating the third checkbox on the page.

One of the main reasons for using XPath is when you don’t have a suitable id or name attribute for the element you wish to locate.

You can use XPath to either locate the element in absolute terms or relative to an element that does have an id or name attribute.

XPath locators can also be used to specify elements via attributes other than id and name.

By finding a nearby element with an id or name attribute, you can locate your target element based on the relationship.

Let the source of a page be:

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>

So, the code to locate the form element will be

login_form = driver.find_element_by_xpath("/html/body/form[1]")
login_form = driver.find_element_by_xpath("//form[1]")
login_form = driver.find_element_by_xpath("//form[@id='loginForm']")

Consequently, the codes to locate username element could be:

username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")

Likewise, the clear element can be locate as follows.

clear_button = driver.find_element_by_xpath("//input[@name='continue'][@type='button']")
clear_button = driver.find_element_by_xpath("//form[@id='loginForm']/input[4]")


4. Locating Hyperlinks by Link Text

You should use this strategy if the link text is inside the anchor tag.
When we use this method, we get the link text of the element specified. If there is no element like the specified one, there will be an error called NoSuchElementException.

For instance, let the source of the page be like this.

<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="next.html">Next</a>
  <a href="previous.html">Back</a>
</body>
<html>

Subsequently, the next.html and previous.html can be located as.

continue_link = driver.find_element_by_link_text('Next')
continue_link = driver.find_element_by_partial_link_text('Back')

5. Locating Elements by Tag Name

You should use this strategy if the desired content is to be located by tag name. When we use this method, we get the first element with the matching class name attribute. If there is no element like the specified one, there will be an error called NoSuchElementException.

For example, let the source be:

<html>
 <body>
  <p class="cont">lorem ipsum.</p>
</body>
<html>

Here, the “p” element can be located as

loc = driver.find_element_by_class_name('cont')

6. Localting Elements by CSS Selector

You can use this method when you want to locate with CSS selector syntax. Likewise, if there is no matching element, there will be an error. The error, as always, is NoSuchElementException.

To illustrate, let the html code be:

<html>
 <body>
  <p class="cont">Lorem ipsum.</p>
</body>
<html>

Likely, the code to locate “p” tag will be:

content = driver.find_element_by_css_selector('p.cont')

E. Wait

Nowadays, many websites are using the AJAX technique. In this technique, the elements of the page load at different times. In contrast, this will create a problem many times. When a locate function runs, but the element is not loaded, there will be an error. The error is called ElementNotVisibleException. Now, to solve this issue, we have a feature in selenium called waits. Nonetheless, we can add some intervals between locating elements or performing actions on them using this feature.

The Selenium WebDriver provides two types of waits are Implicit Waits and Explicit Waits. Similarly, they are explained below.

1. Explicit Waits

In this wait, we make the program wait for some time until the specified condition is fulfilled. The highest wait is time.sleep(), which can be used by importing the python’s built-in module called “time.” To be precise, one of the ways to do this is using WebDriverWait and ExpectedCondition together.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))
    )
finally:
    driver.quit()

If we run the above python code, it will wait for a maximum of 10 seconds for finding the matching element to be found. If there is no element found until the given time, then TimeoutException will occur. The output will be either true or null if success and fail, respectively.

A. Conditions Expected

While automating the web in Python Web scraping using Python Selenium there are some common conditions that are expected. Some of them are listed below.

  • title_is.
  • title_contains.
  • presence_of_element_located.
  • visibility_of_element_located.
  • visibility_of.
  • presence_of_all_elements_located.
  • text_to_be_present_in_element.
  • text_to_be_present_in_element_value.
  • frame_to_be_available_and_switch_to_it.
  • invisibility_of_element_located.
  • element_to_be_clickable.
  • staleness_of.
  • element_to_be_selected.
  • element_located_to_be_selected.
  • element_selection_state_to_be.
  • element_located_selection_state_to_be.
  • alert_is_present.
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, 'someid')))
B. Custom Wait Conditions
class element_has_css_class(object):
  """An expectation for checking that an element has a particular css class.

  locator - used to find the element
  returns the WebElement once it has the particular css class
  """
  def __init__(self, locator, css_class):
    self.locator = locator
    self.css_class = css_class

  def __call__(self, driver):
    element = driver.find_element(*self.locator)   # Finding the referenced element
    if self.css_class in element.get_attribute("class"):
        return element
    else:
        return False

# Wait until an element with id='myNewInput' has class 'myCSSClass'
wait = WebDriverWait(driver, 10)
element = wait.until(element_has_css_class((By.ID, 'myNewInput'), "myCSSClass"))
2. Implicit Waits

Implicit Waits are very rarely used in Selenium. This tells the web driver to poll the DOM for a certain time when the specified element is not immediately available.

from selenium import webdriver

driver = webdriver.Firefox()
driver.implicitly_wait(10) # seconds
driver.get("http://somedomain/url_that_delays_loading")
myDynamicElement = driver.find_element_by_id("myDynamicElement")

F. Page Objects

Finally, we will learn about page object design patterns. In general, The area of a web page where the test is interacting is called page objects.

Some benefits of using page objects patterns are:

  • Code that can be used in many places can be written.
  • Code that is written many times can be reduced.
  • Similarly, If there are changes in the user interface, the code of only one place should be replaced.

1. Test Case

Here is a code which searches if a text is present in the given domain.

import unittest
from selenium import webdriver
import page

class CopyAssignmentsSearch(unittest.TestCase):
       def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.get("https://copyassignment.com/")

    def test_search_in_python_org(self):
        """
        Tests copyassignment.com search feature. Searches for the word "Anaconda" then verified that some results show up.
        Note that it does not look for any particular text in search results page. This test verifies that
        the results were not empty.
        """

        #Load the main page. In this case the home page of copyassignment.com.
        main_page = page.MainPage(self.driver)
        #Checks if the word "Anaconda" is in title
        assert main_page.is_title_matches(), "copyassignment.com title doesn't match."
        #Sets the text of search textbox to "Anaconda"
        main_page.search_text_element = "pycon"
        main_page.click_go_button()
        search_results_page = page.SearchResultsPage(self.driver)
        #Verifies that the results page is not empty
        assert search_results_page.is_results_found(), "No results found."

    def tearDown(self):
        self.driver.close()

if __name__ == "__main__":
    unittest.main()

2. Page Object Classes

The POP helps in creating a separation between test code and technical implementation.

from element import BasePageElement
from locators import MainPageLocators

class SearchTextElement(BasePageElement):
    """This class gets the search text from the specified locator"""

    #The locator for search box where search string is entered
    locator = 'q'


class BasePage(object):
    """Base class to initialize the base page that will be called from all pages"""

    def __init__(self, driver):
        self.driver = driver


class MainPage(BasePage):
    """Home page action methods come here. I.e. Python.org"""

    #Likewise, Declares a variable that will contain the retrieved text
    search_text_element = SearchTextElement()

    def is_title_matches(self):
        """Verifies that the hardcoded text "Python" appears in page title"""
        return "Python" in self.driver.title

    def click_go_button(self):
        """Triggers the search"""
        element = self.driver.find_element(*MainPageLocators.GO_BUTTON)
        element.click()


class SearchResultsPage(BasePage):
    """Search results page action methods come here"""

    def is_results_found(self):
        #Finally,  Probably should search for this text in the specific page
        # element, but as for now it works fine
        return "No results found." not in self.driver.page_source

3. Page Elements

from selenium.webdriver.support.ui import WebDriverWait


class BasePageElement(object):
    """Base page class that is initialized on every page object class."""

    def __set__(self, obj, value):
        """Sets the text to the value supplied"""
        driver = obj.driver
        WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element_by_name(self.locator))
        driver.find_element_by_name(self.locator).clear()
        driver.find_element_by_name(self.locator).send_keys(value)

    def __get__(self, obj, owner):
        """Gets the text of the specified object"""
        driver = obj.driver
        WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element_by_name(self.locator))
        element = driver.find_element_by_name(self.locator)
        return element.get_attribute("value")


4. Locators

In this tradition, the locators of the page are kept in the same class where they are being used.

from selenium.webdriver.common.by import By

class MainPageLocators(object):
    """A class for main page locators. All main page locators should come here"""
    GO_BUTTON = (By.ID, 'submit')

class SearchResultsPageLocators(object):
    """A class for search results locators. All search results locators should come here"""
    pass

Other Tools Like Selenium:


  • Katalon Studio
  • Subject7
  • Screenster
  • TestCraft
  • Endtest
  • Browsersync
  • Protractor
  • CasperJS

Some Other Python Web Scraping Tools:


  • Beautiful Soup
  • LXML
  • MechanicalSoup
  • Python Requests
  • Scrapy
  • urllib

This is a complete python web scraping tutorial and I hope you have learned enough from it as I have tried all the basics which I know by myself. Still, the article of design is left which will be done soon.

Comment if you have any queries or found something wrong in the article or on our website.

Thanks for Reading

Keep Learning


Also Read:


Download Chrome Driver for Selenium

Automate Facebook Login Using Python Selenium

Jarvis and Google Assistant || Voice Assistant Using Python

Crack Any Password Using Python

Get WiFi Passwords With Python

GUI Application To See wifi password in Python


Share:

Author: Ankur Gajurel

I am Ankur from Nepal trying to learn different aspects of Information Technology and Physics. I like building websites and minor projects with Python.