Difference between revisions of "Developing a Web Dashboard for analyzing Amazon's Laptop sales data"

From Sinfronteras
Jump to: navigation, search
(References)
 
(72 intermediate revisions by the same user not shown)
Line 1: Line 1:
<!-- *A "Draft Chapter 1" of your Project (you could see this as an expanded version of your initial proposal). This should include:
+
<ul >
 +
  <li style="margin-bottom: 0px; margin-top: 10px; margin-left: 10px">
 +
      '''Try the App at:''' http://dashboard.sinfronteras.ws
 +
  </li>
  
:*A clear explanation of your project concept (i.e. what are you doing);
 
:*A clear explanation of the rationale for the project or the problem that your project aims to solve (i.e. why are you doing it);
 
:*An outline of the sort of people / organizations who might be able to make use of your finished product (i.e. who are you doing it for);
 
:*An outline of the various technologies that you will be using (i.e. what do you need to do it)
 
:*A draft PLAN showing that you have considered the steps that you need to follow, with your estimate of how long each step will take (i.e. how will you do it). This means you will need to have considered the REQUIREMENTS of your project, at least in draft format.
 
  
* Goal and Objectives
+
  <li style="margin-bottom: 0px; margin-top: -10px; margin-left: 10px">
* Problem
+
      '''Github repository:''' https://github.com/adeloaleman/AmazonLaptopsDashboard
* Solution  -->
+
  </li>
__NUMBEREDHEADINGS__
 
<!-- *'''Data Dashboard for Amazon Laptop Reviews'''
 
*'''Data Dashboard for Amazon Customer Reviews'''
 
*'''Data Dashboard for Laptop Reviews from Amazon'''
 
*'''Web Dashboard for analyzing Laptops sale data from «Amazon»'''
 
*'''Data Science Dashboard for Amazon Customer Reviews''' -->
 
  
*'''Try the App at:''' http://dashboard.sinfronteras.ws
+
  <!--
 +
  <li style="margin-bottom: 0px; margin-top: -10px; margin-left: 10px">
 +
      '''A demo video is available at''' https://www.youtube.com/watch?v=WrvEoA9DD4g
 +
  </li>
 +
  -->
 +
 
 +
  <li style="margin-bottom: 0px; margin-top: -10px; margin-left: 10px">
 +
      '''An example of the data (JSON file) we have scraped from Amazon:''' [[Media:AmazonLaptopReviews.json]]
 +
  </li>
  
  
*'''An example of the data (JSON file) we have scraped from Amazon:''' [[Media:AmazonLaptopReviews.json]]
+
  <li style="margin-bottom: 0px; margin-top: -10px; margin-left: 10px">
: Open it with Firefox if you don't have a proper GoogleChrome plugin to visualize JSON files.
+
      '''This report in a PDF file:'''
 +
  </li>
  
  
<br />
+
  <li style="margin-bottom: -10px; margin-top: -10px; margin-left: 10px">
{{#figure:|label=1}}
+
      '''A short presentation:'''
==What do we want to build?<span id="hola como"></span>==
+
  </li>
 +
</ul>
  
  
<br />
+
<br >
===Goals===
+
==Introduction==
This project aim to develop a [https://www.klipfolio.com/resources/articles/what-is-data-dashboard Data Dashboard] for analyzing Laptop Customer Reviews from Amazon.
+
When I started thinking about this project, the only clear point was that I wanted to work in Data Analysis using Python. I had already got some experience in this field in my final BSc in IT Degree project, working on a Supervised Machine Learning model for Fake News Detection. So, this time, I had some clearer ideas of the scope of data analysis and related fields and about the kind of project I was looking to work on. In addition to data analysis, I'm also interested in Web development. Therefore, in this project, along with Data Analysis, I was also looking to give important weight to web development. In this context, I got with the idea of developing a '''Web Dashboard for analyzing Amazon Laptop Reviews'''.
  
  
In a general sense, «a Data Dashboard is an information management tool that visually tracks, analyzes and displays key performance indicators (KPI), metrics and key data points to monitor the health of a business, department or specific process» <ref name="dashboard_def">klipfolio.com, What is a data dashboard? https://www.klipfolio.com/resources/articles/what-is-data-dashboard</ref>. That is not a bad definition to describe the application we want to build. We just need to highlight that our Dashboard is going to display information about Laptop sales data from Amazon (customer reviews, in particular).
+
In a general sense, "a Data Dashboard is an information management tool that visually tracks, analyzes and displays key performance indicators (KPI), metrics and key data points to monitor the health of a business or specific process" <ref name=":11" />. That is not a bad definition to describe the application we are building. We just need to highlight that our Dashboard displays information about Laptop sales data from Amazon (customer reviews, in particular).
<!-- :«A dashboard is a type of graphical user interface which often provides at-a-glance views of key performance indicators (KPIs) relevant to a particular objective or business process» [https://en.wikipedia.org/wiki/Dashboard_(business)] -->
 
  
  
We wanted to mentione that the initial proposal was to develop a Dashboard for analyzing a wider range of products from Amazon (not only Laptops). However, because we have limited HR and time to accomplish this project, we were forced to reduce the scope of the application. This way, we are trying to make sure that we count with enough time to develop a functional Dashboard with visual appeal and a decent web design.
+
We wanted to mention that the initial proposal was to develop a Dashboard for analyzing a wider range of products from Amazon (not only Laptops). However, because we had limited HR and time to accomplish this project, we were forced to reduce the scope of the application. This way, we were trying to make sure that we were going to be able to develop a functional Dashboard on the timeframe provided.
  
  
'''Some examples of the kind of Application we want to build:'''
+
It is also important to highlight that the visual appeal of the application is being considered an essential aspect of the development process. We are aware that this is currently a very important feature of a web application so we are taking the necessary time to make sure we develop a Dashboard with visual appeal and decent web design.
*'''Example 1:''' (<xr id="fig:dashboard_ex1" />)
 
: Sentiment Analysis Dashboard: https://powerbi.microsoft.com/en-us/partner-showcase/faction-a-sentiment-analysis-dashboard/
 
: Demo video: https://www.youtube.com/watch?v=R5HkXyAUUII
 
:: This is the most similar application to the one we want to build that we have found in our research. Our initial requirements and prototype will be based in some of the component shown in this application.
 
  
  
*'''Example 2:''' In <xr id="fig:dashboard_ex2" /> and <xr id="fig:dashboard_ex3" /> we show two other Dashboard. These ones don't have a similar function to the Dashboard we want to built. They are other purposes dashboards, but they are good examples of the design and visual appeal of the application we aim to build.
+
<br >
: https://designrevision.com/demo/shards-dashboard-lite-react/blog-overview
+
'''Some examples of the kind of Application we are developing:'''
: https://themesdesign.in/upcube/layouts/horizontal/index.html
 
  
 +
* '''Example 1:''' Faction A - Sentiment Analysis Dashboard (<xr id="dashboardEx1" />) <ref name=":7" />
  
*'''Example 3:''' You can also try some Dashboard that have been developed with the same technologies that we are going to use (Python - Dash). We will talk a little more about this technologies later:
+
: This is a very similar application to the one we are building. Our initial requirements and prototype were based on some of the components shown in this application.
:*https://dash-gallery.plotly.host/dash-oil-and-gas/
 
:*https://dash-gallery.plotly.host/dash-web-trader/
 
  
<br />
 
  
<figure id="fig:dashboard_ex1">
+
<figure id="dashboardEx1">
 
[[File:SentimentAnalysisDashboard-FactionA.png|center|thumb|850x850px|
 
[[File:SentimentAnalysisDashboard-FactionA.png|center|thumb|850x850px|
<caption>Example of the kind of Dashboard we want to build.
+
<caption>Faction A - Sentiment Analysis Dashboard
 
<br />
 
<br />
https://powerbi.microsoft.com/en-us/partner-showcase/faction-a-sentiment-analysis-dashboard/
+
This is an example of the kind of Dashboard we are building <ref name=":7" />
 
<br />
 
<br />
 
https://www.youtube.com/watch?v=R5HkXyAUUII</caption>
 
https://www.youtube.com/watch?v=R5HkXyAUUII</caption>
 
]]
 
]]
 
</figure>
 
</figure>
 +
  
 
<br />
 
<br />
 +
* '''Example 2:''' Shards dashboard (<xr id="dashboardEx2" />) <ref name=":20" />
 +
 +
: This dashboard doesn't have a similar function to the dashboard we are building. However, it is a good example of the design and visual appeal of the application we aim to build.
 +
  
<figure id="fig:dashboard_ex2">
+
<figure id="dashboardEx2">
 
[[File:Dashboard1.png|center|thumb|850x850px|
 
[[File:Dashboard1.png|center|thumb|850x850px|
<caption>Example of the kind of Dashboard we want to build.
+
<caption>Shards dashboard. This is an example of the kind of Dashboard we are building <ref name=":20" />
 +
]]
 +
</figure>
 +
 
 +
 
 +
 
 +
At this point, a version 1.0 of the application has already been built and deployed. It is currently running for testing purposes at http://dashboard.sinfronteras.ws
 +
 
 +
 
 +
You can also access the source code and download it from our Github repository at http://github.com/adeloaleman/AmazonLaptopsDashboard
 +
 
 +
 
 +
 
 +
The content of this report is organized in the following way:
 +
 
 +
* We start by giving some arguments that justify the business value of this kind of application. We understand that the scope of our Dashboard is limited because we are only considering Laptop data, but we try to explain in a broad sense, the advantage of analyzing retail sales data in order to enhance a commercial strategy.
 +
       
 +
 
 +
* In the [[Developing a Web Dashboard for analyzing Amazon's Laptop sales data#Development process|Development process]] Section, we will go through the different phases of the development process. The goal is to make clear the general architecture of the application, the technologies we have used and the reasons for the decisions made. Only some portions of the codes will be explained. We think there is no point in explaining all programming details of hundreds of lines of code.
 +
   
 +
 
 +
* We finally describe the process followed to deploy the application in the cloud. In this chapter, we explain the current deployment architecture, in which the back-end and front-end are running in the same server, but we also address a three-tier architecture for high availability and strict security policies that has been designed for future cloud deployment of the Dashboard.
 +
 
 +
 
 +
<br />
 +
 
 +
==Project rationale and business value==
 +
In marketing and business intelligence, it is crucial to know what customers think about target products. Customer opinions, reviews, or any data that reflect the experience of the customer, represents important information that companies can use to enhances their marketing and business strategies.
 +
 
 +
Marketing and business intelligence professionals, also need a practical way to display and summary the most important aspects and key indicators about the dynamic of the target market; but, what we want to say when we refer to the dynamic of the market? We are going to use this term to refer to the most important information that business professional requires to understand the market and thus be able to make decisions that seek to improve the revenue of the company.
 +
 +
Now, let's explain with a practical example, which kind of information business professionals need to know about a target product or market. Suppose that you are a Business Intelligence Manager working for an IT company (Lenovo, for example). The company is looking to improve its Laptops sale strategy. So, which kind of information do you need to know, to be able to make key decisions related to the Tech Specs and features that the new generation of Laptops should have to become a more attractive product into the market? You would need to analyses, for instance:
 +
 
 +
* Which are the best selling Laptops.
 +
* Are Lenovo Laptops in a good position in the market?
 +
* What are the top Lenovo Competitors in the industry.
 +
* What are the key features that customers take into consideration when buying a Laptop.
 +
* What are the key tech specs that customers like (and don't like) about Lenovo and Competitors Laptops.
 +
* How much customers are willing to pay for a Laptop.
 +
 
 +
Those are just some examples of the information a business intelligence professional need to know when looking for the best strategy. Let's say that, after analyzing the data, you found that the top-selling Laptops are actually the most expensive ones. Laptops with high-quality tech specs and performance. You also found that Lenovo Laptops are, in general, under the range of prices and quality tech specs of the top-selling Laptops.
 +
 
 +
 
 +
With the above information, a logical strategy could be to invest in an action plan to improve the tech specs and general quality of Lenovo Laptops. If, on the contrary, the information highlights that very expensive Laptops have a very low demand, an intelligent approach could be a strategy to reduce the cost of the new generation of Laptops.
 +
 
 +
 
 +
So, we have already seen the importance of analyzing relevant data to understand the dynamics of the market when looking to enhance the business strategy. Now, from where and how can we get the necessary data to perform a market analysis for a business plan?
 +
 
 +
 
 +
This kind of data can be collected by asking directly information from retailers. For example, if you have access to a detailed Annual Report for Sales & Marketing of a computer retailer, you will have the kind of information that can be valuable to understand the dynamic of the market. This report could contain details about the best selling computers, prices, tech specs, revenues, etc. However, from a Sales Annual report would be missing detailed information about what customers think of the product they bought. Traditionally, this kind of data has been collected using methods such as '''face to face or telephone surveys'''.
 +
 
 +
 
 +
Recently, '''the huge amount of data generated every day in social network and online retailer''', is being used to perform analysis that allows us to gain a better understanding of the market and, in particular, about customer opinions. This method is becoming a more effective, practical, and cheaper way of gathering this kind of information compare to traditional methods.
 +
 
 +
 
 +
<br >
 +
 
 +
==Development process==
 +
 
 +
<figure id="fig:architecture1">
 +
[[File:AmazonLaptopDashboard-Architecture_diagram1.png|center|thumb|1000px|
 +
<caption>Application architecture diagram: This is the big picture of the application operation. Notice the different components and technologies, and the connection between them.
 
<br />
 
<br />
https://designrevision.com/demo/shards-dashboard-lite-react/blog-overview</caption>
+
This is the current implementation. All the components are running on the same server. However, the goal is to implement a 3-Tier Architecture as shown in <xr id="fig:architecture2" />.</caption>
 
]]
 
]]
 
</figure>
 
</figure>
  
<br />
 
  
<figure id="fig:dashboard_ex3">
+
<figure id="fig:architecture2">
[[File:Dashboard3.png|center|thumb|850x850px|
+
[[File:AmazonLaptopDashboard-Architecture_diagram2.png|center|thumb|1000px|
<caption>Example of the kind of Dashboard we want to build.
+
<caption>Three-Tier application architecture diagram. This will be the final deployment of the application.</caption>
 +
]]
 +
</figure>
 +
 
 +
 
 +
<br >
 +
===Back-end===
 +
 
 +
 
 +
<br >
 +
====Scraping data from Amazon====
 +
We first need to get the data that we want to display and analyze in the Dashboard.
 +
 
 +
 
 +
As we have already mentioned, we want to extract data related to laptops for sale from www.amazon.com. The goal is to collect the details of about 100 Laptops from different brands and models and save this data as a JSON text file.
 +
 
 +
 
 +
To better explain which information we need, In <xr id="fig:amazon_page" />, we show a Laptop sale page from https://www.amazon.com/A4-53N-77AJ/dp/B07QXL8YCX. From that webpage, we need to extract the following information:
 +
 
 +
 
 +
<!-- {| style="width:60%; float:right; margin-left: 10px;" -->
 +
{| class="wikitable" style="width:80%; margin: 0 auto"
 +
|+
 +
<!-- Caption -->
 +
!style="width:33%;"|Main details
 +
!style="width:33%;"|Tech details
 +
!style="width:33%;"|Reviews
 +
|- style="vertical-align:top;"
 +
|
 +
* '''URL:''' https://www.amazon.com/A…4-53N-77AJ/dp/B07QXL8YCX
 +
* '''ASIN:''' B07QXL8YCX
 +
* '''Price:''' $709.99
 +
* '''Average customer reviews:''' 4.3 out of 5 stars
 +
* '''Number of reviews:''' 167
 +
* '''Number of ratings:''' 180
 +
 
 +
|
 +
* '''Screen Size:''' 14 inches
 +
* '''Max Screen Resolution:''' 1920 x 1080
 +
* '''Processor:''' 4.6 GHz Core i7 Family
 +
* '''RAM:''' 16 GB DDR4
 +
* '''Hard Drive:''' 512 GB Flash Memory Solid State
 +
* .
 +
* .
 +
* .
 +
|
 +
 
 +
* '''Review 1:'''
 +
*:* '''User name:''' Tom Steele
 +
*:* '''Rating:''' 4.0 out of 5 stars
 +
*:* '''Date:''' June 13, 2019
 +
*:* '''Review title:''' If only it had a USB-C port!
 +
*:* '''Review:''' This is a great computer in many respects,  The specs of the computer are great and it is fast and snappy. The 512 GB SSD feels fast and speedy. The 16Gb of fast ram is nice. The i7 processor is a beast, with speed (and noisy fan) bursts when needed. This is a fast, well-optioned machine inside...
 +
 
 +
* '''Review 2:'''
 +
 
 +
::* .
 +
::* .
 +
::* .
 +
 
 +
* '''Review 3:'''
 +
 
 +
::* .
 +
::* .
 +
::* .
 +
|}
 +
 
 +
 
 +
The process of extracting data from Websites is commonly known as web scraping <ref name=":16" />. A web scraping software can be used to automatically extract large amounts of data from a website and save the data for later processing. <ref name=":15" />
 +
 
 +
 
 +
In this project, we are using Scrapy as web scraping solution. This is one of the most popular Python Web Scraping frameworks. According to its official documentation, "Scrapy is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages" <ref name=":13" />
 +
 
 +
 
 +
The Code Snippet included in <xr id="fig:code_scrapy" /> shows an example of a Scrapy Python code. This is a portion of the program we built to extract Laptop data from www.amazon.com. The goal of this report is not to explain all the technical details about the code. That would be such long documentation so it is better to review the official Scrapy documentation for more details. <ref name=":14" />
 +
 
 +
 
 +
You can access the source code from our Github repository at
 +
https://github.com/adeloaleman/AmazonLaptopsDashboard/blob/master/AmazonScrapy/AmazonScrapy/spiders/amazonSpider.py
 +
 
 +
 
 +
Web scraping can become a complex task when the information we want to extract is structured on more than one page. For example, in our project, the most important data we need is customer reviews. Because a Laptop in Amazon can have countless reviews, this information can become so extensive that it cannot be displayed on a single page but on a set of similar pages.
 +
 
 +
 
 +
'''Let's explain the process and some of the complexities of extracting Laptop data from Amazon.com:'''
 +
 
 +
 
 +
:* The following is the link to the base Amazon laptops page. This page display the first group of 24 Laptops available at Amazon.com including all branches and features: https://www.amazon.com/Laptops-Computers-Tablets/s?rh=n%3A565108&page=1
 +
 
 +
       
 +
:* If you want to keep reviewing Laptos you need to make click in the «Next Page» link at the button of the page, which will bring you to the next group of 24 laptos. However, istead of clicking the «Next Page» link, you could also enter the following URL in you web browser: https://www.amazon.com/Laptops-Computers-Tablets/s?rh=n%3A565108&page=2
 +
 
 +
   
 +
:* Then, to go to the third group of 24 Laptops you can use use the same address with a 3 at the end: https://www.amazon.com/Laptops-Computers-Tablets/s?rh=n%3A565108&page=3
 +
 
 +
   
 +
:* So, to extract information from a large number of Laptops, we have to scrape a sequence of similar pages.
 +
 
 +
   
 +
:* Most of the information we want to extract from each Laptop will be on the Laptop main page. As we saw in <xr id="fig:amazon_page" />, we can get the "ASIN", "Price", "Average customer reviews" and other details from this main page.
 +
 
 +
   
 +
:* However, customer reviews are such a long data that cannot be displayed on only one page but on a sequence of pages linked to the main Laptop page. So, in the case of our example Laptop page (<xr id="fig:amazon_page" />), customer reviews are displayed in these pages:
 +
   
 +
:::* https://www.amazon.com/Acer-Convertible-Fingerprint-Rechargeable-SP314-53N-77AJ/product-reviews/B07QXL8YCX/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1
 +
       
 +
 
 +
:::* https://www.amazon.com/Acer-Convertible-Fingerprint-Rechargeable-SP314-53N-77AJ/product-reviews/B07QXL8YCX/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2
 +
       
 +
 
 +
:::* https://www.amazon.com/Acer-Convertible-Fingerprint-Rechargeable-SP314-53N-77AJ/product-reviews/B07QXL8YCX/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3
 +
       
 +
 
 +
:::* ...
 +
 
 +
 
 
<br />
 
<br />
https://themesdesign.in/upcube/layouts/horizontal/index.html</caption>
+
<figure id="fig:amazon_page">
 +
[[File:Amazon_page01.png|950px|thumb|center|
 +
<caption>Laptop page from https://www.amazon.com/A…4-53N-77AJ/dp/B07QXL8YCX</caption>
 
]]
 
]]
 
</figure>
 
</figure>
Line 91: Line 267:
  
 
<br />
 
<br />
 +
<figure id="fig:code_scrapy">
 +
<syntaxhighlight lang="python3">
 +
import scrapy
 +
 +
class QuotesSpider(scrapy.Spider):
 +
    name = "amazon_links"
  
==Project rationale and business value==
+
    start_urls=[]
<!-- Problem that we aim to solve -->
+
 
In marketing and business intelligence, it is crucial to know what customers think about target products. Customer opinions, reviews or any data that reflect the experience of the customer, represents an important information that companies can use to enhances their marketing and business strategies.
+
    # Link to the base Amazon laptos page. This page display the first group of Laptops
 +
    # available at Amazon.com including all branches and features:
 +
    myBaseUrl = 'https://www.amazon.com/Laptops-Computers-Tablets/s?rh=n%3A565108&page='
 +
    for i in range(1,3):
 +
          start_urls.append(myBaseUrl+str(i))
 +
 
 +
 
 +
    def parse(self, response):
 +
          data = response.css("a.a-text-normal::attr(href)").getall()
 +
          links =   [s for s in data if "Lenovo-"  in  s
 +
                                    or "LENOVO-"  in  s
 +
                                    or "Hp-"      in  s
 +
                                    or "HP-"      in  s
 +
                                    or "Acer-"    in  s
 +
                                    or "ACER-"    in  s
 +
                                    or "Dell-"    in  s
 +
                                    or "DELL-"    in  s
 +
                                    or "Samsung-" in  s
 +
                                    or "SAMSUNG-" in  s
 +
                                    or "Asus-"    in  s
 +
                                    or "ASUS-"    in  s
 +
                                    or "Toshiba-" in  s
 +
                                    or "TOSHIBA-" in  s
 +
                    ]
 +
 
 +
          links = list(dict.fromkeys(links))
 +
          links = [s for s in links if "#customerReviews"  not in s]
 +
          links = [s for s in links if "#productPromotions" not in s]
  
 +
          for  i in range(len(links)):
 +
              links[i] = response.urljoin(links[i])
 +
              yield response.follow(links[i], self.parse_compDetails)
  
Marketing and business intelligence professionals, also need a practical way to display and resume the most important aspects and key indicators about the dynamic of the target market; but, what we want to say when we refer to the dynamic of the market? Well! we are going to use this term to refer to the most important information that business professional require to understand the market and thus be able to make decision that seek to improve the revenue of the company.
 
  
 +
    def parse_compDetails(self, response):
 +
          def extract_with_css(query):
 +
              return response.css(query).get(default='').strip()
  
Now, let's explain with a practical example, which kind of information business professionals need to know about a target product or market. Suppose that you are a Business Intelligence Manager working for an IT company (Lenovo, for example). The company is looking to improve its Laptops sale strategy. So!, which kind of information do you need to know, to be able to make key decisions related to the Tech Specs and features that the new generation of Laptops should have to become a more attractive product into the market? You would need to analyses, for instance:
+
          price = response.css("#priceblock_ourprice::text").get()
  
*Which are the best selling Laptops.
 
*Are Lenovo Laptos in a good position into the market?
 
*What are the  top Lenovo Competitors in the industry.
 
*What are the key features that customers take into consideration when buying a Laptop.
 
*What are the key tech specs that customers like (and don't like) about Lenovo and Competitors Laptops.
 
*How much customers are willing to pay for a Laptop.
 
  
 +
          product_details_table  = response.css("#productDetails_detailBullets_sections1")
 +
          product_details_values = product_details_table.css("td.a-size-base::text").getall()
 +
          k = []
 +
          for i in product_details_values:
 +
              i = i.strip()
 +
              k.append(i)
 +
          product_details_values = k
  
Those are just some examples of the information a business intelligence professional need to know when looking for the best strategy. Let's say that, after analyzing the data, you found that the top selling Laptops are actually the most expensive ones. Laptops with high quality tech specs and performance. You also found that Lenovo Laptops are, in general, under the range of prices and quality tech specs of the top selling Laptops.
+
          ASIN = product_details_values[0]
  
 +
          average_customer_reviews = product_details_values[4]
  
With the above information, a logical strategy could be to invest in an action plan to improve the tech specs and general quality of Lenovo Laptops. If, on the contrary, the information highlights that very expensive Laptops have a very low demands, an intelligent approach could be a strategy to reduce costs of the new generation of Laptops.
+
          number_reviews_div = response.css("#reviews-medley-footer")
 +
          number_reviews_ratings_str  = number_reviews_div.css("div.a-box-inner::text").get()
 +
          number_reviews_ratings_str  = number_reviews_ratings_str.replace(',', '')
 +
          number_reviews_ratings_str  = number_reviews_ratings_str.replace('.', '')
 +
          number_reviews_ratings_list = [int(s) for s in number_reviews_ratings_str.split() if s.isdigit()]
 +
          number_reviews = number_reviews_ratings_list[0]
 +
          number_ratings = number_reviews_ratings_list[1]
  
  
So, we have already seen the importance of analyzing relevant data to understand the dynamic of the market when looking to enhance the business strategy. Now, from where and how can we get the necessary data to perform a market analysis for a business plan?
+
          reviews_link = number_reviews_div.css("a.a-text-bold::attr(href)").get()
 +
          reviews_link = response.urljoin(reviews_link)
  
  
This kind of data can been collected by asking directly information from retailers. For example, if you have access to a detailed Annual Report for Sales & Marketing of a computer retailer, you will have the kind of information that can be valuable to understand the dynamic of the market. The report could contains details about the best selling computers, prices, tech specs, revenues, etc. However, from a Sales Annual report, would be missing detailed information about what customer think of the product they bought. Traditionally, this kind of data has been collected using methods such as '''face to face or telephone surveys'''.
+
          tech_details1_table  = response.css("#productDetails_techSpec_section_1")
 +
          tech_details1_keys  = tech_details1_table.css("th.prodDetSectionEntry")
 +
          tech_details1_values = tech_details1_table.css("td.a-size-base")
  
 +
          tech_details1 = {}
 +
          for i in range(len(tech_details1_keys)):
 +
              text_keys  = tech_details1_keys[i].css("::text").get()
 +
              text_values = tech_details1_values[i].css("::text").get()
  
Recently, '''the huge amount of data generated every day in social network and online retailer''', is being used to perform analysis that allow to gain a better understanding of the market and, in particular, about customer opinions. This method is becoming a more effective, practical and cheaper way of gathering this kind information compare to traditional methods.
+
              text_keys  = text_keys.strip()
 +
              text_values = text_values.strip()
  
<!-- Sentiment Analysts, is a method that uses natural language processing and text analysis techniques to extract key indicators from text data. It is widely used in reviews and survey analysis. -->
+
              tech_details1[text_keys] = text_values
  
  
<br />
+
          tech_details2_table  = response.css("#productDetails_techSpec_section_2")
===Intended target market===
+
          tech_details2_keys  = tech_details2_table.css("th.prodDetSectionEntry")
A data analysis dashboard for sale strategy has a very wide target market. For any company that sales something, it would be beneficial to analyses data that allows them to understand the dynamic of the market THEY are targeting.
+
          tech_details2_values = tech_details2_table.css("td.a-size-base")
  
As we have already mentioned, we were forced to reduce the scope of this project because of the time that we have to accomplish it. So the final result will be a Web Dashboard for analyzing «Laptops» sale data from «Amazon» (Customer online reviews, in particular). However, we would like to invite you to think about this project as a methodology to build a Dashboard for analyzing online retailers sale data; regardless the target products or where the data is scraped from.
+
          tech_details2 = {}
 +
          for i in range(len(tech_details2_keys)):
 +
              text_keys  = tech_details2_keys[i].css("::text").get()
 +
              text_values = tech_details2_values[i].css("::text").get()
  
<!-- So, in the sense described above, we can say that our project can be applied across a whole range of sectors. -->
+
              text_keys  = text_keys.strip()
 +
              text_values = text_values.strip()
  
Now, let's be more specific about who can use our app. We will address two examples:
+
              tech_details2[text_keys] = text_values
  
*'''Example 1:''' When explaining the business value of the project, we talked about the example of a computer manufacturer that is looking to enhance its sale strategy. That is an example for a very high business level. However, as we will see in the next example, this kind of analysis can also be beneficial for small business.
+
          tech_details = {**tech_details1 , **tech_details2}         
  
*'''Example 2:''' Think about a small computer retailer that is looking to enhance its sales. They also need information about the dynamic of the Laptop market: which are the top sellers, how much customers are willing to pay, which features are customers looking in a laptop, etc. This information will allow them to invest in the best marketing and sale strategy.
 
  
<br />
+
          reviews = []
 +
          yield response.follow(reviews_link,
 +
                                self.parse_reviews,
 +
                                meta={
 +
                                  'url': response.request.url,
 +
                                  'ASIN': ASIN,
 +
                                  'price': price,
 +
                                  'average_customer_reviews': average_customer_reviews,
 +
                                  'number_reviews': number_reviews,
 +
                                  'number_ratings': number_ratings,
 +
                                  'tech_details': tech_details,
 +
                                  'reviews_link': reviews_link,
 +
                                  'reviews': reviews,
 +
                                })
 +
</syntaxhighlight>
 +
<caption>Portion of the Python Scrapy program we built to extract Laptop data from www.amazon.com</caption>
 +
</figure>
  
<!-- ===Porter's Five Forces analysis===
 
[[Data Dashboard for Amazon Laptop Reviews#hola]]
 
  
 
<br />
 
<br />
<figure id="fig:poters">
+
<figure id="fig:amazon_data">
[[File:Porter five forces analysis.svg|center|thumb|694x694px|
+
[[File:AmazonLaptopDashboard-Data json.png|700px|thumb|center|
<caption>Porter's Five Forces analysis</caption>
+
<caption>Data Scraped from Amazon using Scrapy. Download or open the File from [[Media:AmazonLaptopReviews.json]]</caption>
 
]]
 
]]
 
</figure>
 
</figure>
 +
 +
 +
<br >
 +
 +
====Data Analytic====
 
<br />
 
<br />
  
 +
* '''Loading the data'''
 +
 +
* '''Data Preparation and Text pre-processing'''
 +
 +
::* Creating new columns to facilitate handling of customer reviews ant tech details
 +
::* Modifying the format and data type of some columns
 +
::* Removing punctuation and stopwords
 +
 +
* '''Data visualization'''
 +
 +
::* Avg. customer reviews \& Avg. price bar charts
 +
::* Avg. customer reviews Vs. Price bubble chart
 +
::* Customer reviews word cloud
 +
::* Word count bar chart of customer reviews
 +
   
 +
* '''Sentiment analysis'''
  
In <xr id="fig:poters" />, we show the Poter's Five Forces Analysis. Let's explain some of the most important points included in this diagram to be able to reach logical conclusions.
 
  
*'''Rivalry:'''
+
<br >
 +
=====Loading the data=====
 +
After extracting the data from Amazon using Scrapy, we have stored the data into a simple JSON text file (Figure \ref{fig:data_json}). Here we are importing the data from the JSON text file into a pandas dataframe:
  
::*Even if we cannot say that Data analytic is a new field of study, it is true that it has gained growing importance in recent years. This growth has been driven mostly by the large amount of data generated every day on the Internet. That is why, even if there are already some big players and a NO small number of new companies investing in this sector, we think that there are still a lot of opportunities in this big market. Therefore, we are convinced that Rivalry is not the biggest threat. We actually think that it represent a small risk.
 
:::*We think the Rivalry context is <span style="color:blue">POSITIVE</span> for our project.
 
  
 +
<syntaxhighlight lang="python3">
  
*'''Threat of new entrants:'''
+
</syntaxhighlight>
  
::*In this section, we think that the most important threat is the facility to enter in the industry. A company that is looking to enter this market, would  have to invest mostly in HR. There are a lot of free technologies that can be implemented to start developing the project before is ready to be launched into the market.
 
:::*We think the Rivalry context is <span style="color:red">NEGATIVE</span> for our project.
 
  
 +
<br >
 +
=====Data Preparation and Text pre–processing=====
 +
In the appendices, we have included the Python code used for Data Preparation and Text pre-processing. Below, we will explain step by step the process followed to prepare the data
  
*'''Buyer power:'''
 
  
::*'''Large number of customers:''' As we have already mentioned, for any company that sales something, it would be beneficial the use of this kind of Applications. So we can say that we have a wide target market. Also, because this sector is growing rapidly, we are convinced that there is a huge number of customers. This would of course reduce the threat of Buyer power.
+
<blockquote>
 +
'''Creating new columns to facilitate handling of customer reviews and tech details:'''
  
::*We also think that there would not be (or it would be really low) the switching costs for the buyer. Therefore, contrary to the previous point, this fact would increase the threat of Buyer power.
+
<blockquote>
 +
After loading the data from the JSON file, every "review" entry is a dictionary type value that is composed of several fields: customer name, rating, date, title, and the text of the review itself.
  
:::*We think the Rivalry context is '''NEUTRO''' for our project.
 
  
 +
Here we extract the relevant details (title and the text of the review itself) and create 3 new columns to facilitate the handling of the "review" entries.  We create the following columns: "reviews_title", "reviews_text" and "reviews_one_string":
  
*'''Threat of substitutes:'''
 
  
::*'''Ability to easily substitute:''' We think that our product is easy to replace. So this would be a clear threat. We have also taken into consideration, that it is hard to accomplish a real distinction with respect to a similar product. Therefore, we also see a big risk in this aspect.
+
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
  
  
*'''Power of suppliers:'''
+
After loading the data from the JSON file, all technical details are in a dictionary type entry. In the following block, we are extracting the tech details that are important for our analysis ("series" and "model_number")and creating new columns for each of these relevant tech details:
  
::*This aspect does not apply to our project. We would no need any particular supplier. This aspect does not apply to our project. We would no need any particular supplier. We just need some technologies that are very well known into the market and can actually be replaced for other ones.
 
  
::*The only risk that we could see in this point is that some technology become deprecated after we have built the platform; but this would be a very unlikely scenario.
+
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
  
:::*We think the Rivalry context is <span style="color:blue">POSITIVE</span> for our project.
+
 
 +
</blockquote>
 +
</blockquote>
 +
 
 +
 
 +
<blockquote>
 +
'''Modifying the format and data type of some columns:'''
 +
<blockquote>
 +
 
 +
Here we make sure that the first character of the brand name is uppercase and remaining characters lowercase. This is important because we are going to perform filtering and searching function using the brand name so we need to make sure the writing is consistent:
 +
 
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
After extracting the data from the web page, the numeric values ("average_customer_reviews" and "price")are actually of "string" type. So, We need to convert the entry to a numeric type (Float). This is necessary because we will perform mathematical operations with these values.
 +
 
 +
The following function takes a numeric string (<class 'str>'), removes any comma or dollar characters ("," "$") and returns a numeric float value (<class 'float'>):
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
A raw "average_customer_reviews" entry looks like this: "4.5 out of 5 stars" (<class 'str>')
 +
 
 +
We only need the first value as a numeric float type: 4.5 (<class 'float'>)
 +
 
 +
This is done in the next line of code over the entire dataframe by selecting only the first element ("4.5" in the above example) and applying the "format_cleaner()" function to the "average_customer_reviews" column:
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
A raw "price" entry looks like this: "$689.90" (<class 'str'>) We only need the numeric value: 689.90 (<class 'float'>) This is done in the next line of code over the entire dataframe by applying the "format_cleaner()" function to the "price" column:
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
</blockquote>
 +
</blockquote>
 +
 
 +
 
 +
<blockquote>
 +
'''Removing punctuation and stopwords:'''
 +
<blockquote>
 +
 
 +
* '''Punctuation:''' We will remove all punctuation char found thestringPython library.
 +
 
 +
* '''Our stopwords will be composed by:'''
 +
 
 +
::* The common stopwords defined in the ''nltk'' library
 +
 
 +
::* Some particular stopwords related to our data:
 +
 
 +
::::* Brand names: There is no point in analyzing brand names. For instance, in a Lenovo review, the customer will use the word "Lenovo" many times, but this fact does not contribute anything to the analysis.
 +
 
 +
::::* Laptop synonyms: laptop, computer, machine, etc.
 +
 
 +
::::* Some no-official contractions that are not in the ''nltk'' library: Im, don't, Ive, etc.
  
  
 
<br />
 
<br />
'''Some conclusions about the Porter's Five Forces analysis:'''
+
Defining our stopwords list:
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
The following function takes a string and returns the same string without punctuation or stopwords:
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
 
 +
Example of applying the "pre_processing()" function :
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
Here we are applying the function "pre_processing()" to the "reviews_one_string" column over the entiredataframe:
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
 
 +
'''At this point, the data is ready for visualization at the front-end:'''
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
</blockquote>
 +
</blockquote>
 +
 
  
After analyzing the aspects that we consider most important for our project. We are convinced that the context presented by the Porter's analysis is positive for our project.
+
<br >
  
We think that the huge growth of the Data Analytic sector in the latest years, and most importantly, the grow expected for the next years is a fact that  can, without any doubt, be used to gain Competitive Advantage. According to the U.S. Bureau of Labor Statistics, 11.5 million Data Science job will be created by 2026 [\cite{r:data_jobs}].
+
=====Data visualization=====
 +
As we mentioned, this version 1.0 hasn't been clearly defined in a 3-tiers architecture yet. At this point, everything is running in one server (Figure ??). However, the analysis has always been thought with the goal of deploying the application in a 3-tiers architecture (Figure ??).
  
In the Porter's Five Forces analysis we saw that the mos important threat we identified was the Threat of new entrants. Any IT company could easily invest in this sector and would represent a big risk for in the market. However, we concluded that the other aspects of the Porter's Analysis are positive for our project. We specially think that the first inversion to start the project would not be so high and we don't need not count with any particular Supplier. We can start developing with a very high rage of free technologies.
 
  
 +
Regarding chart building for data visualization, it has been particularly challenging do define the tier of the application architecture (back-end or front-end) in which chart building must be placed.
  
<br /> -->
 
  
==Requirements==
+
Although it is clear that the charts itself are part of the frontend, there is usually a data processing step -linked with graphing- that could be placed at the backend.
<!-- In a more realistic scenario -->
 
Depending on the nature of the projects, the requirements of a software development strategy can be gathered using different methods:
 
  
  
*'''Questionnaires''' and '''Interviews''': If there is a '''client''' or if we are in contact with the final '''users''', methods like questionnaires and interviews with the client/users are usually carried out to determine requirements. In case of a client, initial requirements are usually provided by the client at the beginning of the project and redefined in every stage of the the development process as the client and developers identifies new ones. <ref name="s_analysis">Carol Britton and Jill Doake, A Student Guide to Object-Oriented Development, 2005 , Elsevier.</ref>
+
This is why we prefer to explain the programming details related to data visualization in this section. However, other data visualization aspects related to the same charts explained in this section will be treated in the Front-end section.
  
  
*'''Assessment of the current computer system:''' If there is a current system, this must be tested and evaluated to determine requirement of the new one. <ref name="s_analysis" />
+
<br >
 +
======Avg customer reviews vs Avg price bar charts======
 +
We will only talk about the code to generate the charts in this section. We will explain why we have included these charts in the application and discuss other analytical aspects in the Front-end section.
  
  
*'''Scenarios:''' «A scenario is a sequence of interactions between a user and the system carried out in order to satisfy a specified goal» <ref name="s_analysis" />. This is a very popular method to determine requirements since it can be used when there is no a client, final users or current computer system.
+
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
  
  
In our case, we started building a list of requirements by analyzing similar current Dashboards and using Scenarios. This way, the first prototype of our Dashboard will be built based on some of the components of this Application: https://www.youtube.com/watch?v=R5HkXyAUUII
 
  
<!-- As we mentioned in Section {{#xref:|page=Data Dashboard for Amazon Laptop Reviews |label=1}}, we based our first -->
+
<br >
 +
======Avg customer reviews vs Price bubble chart======
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
 
 +
<br >
 +
======Customer reviews word cloud======
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
 +
<br >
 +
======Word count bar chart of customer reviews======
 +
 
 +
<syntaxhighlight lang="python3">
 +
.
 +
</syntaxhighlight>
 +
 
 +
 
  
 
<br />
 
<br />
  
==Stage of development and technologies==
+
=====Sentiment analysis=====
Python is the main programming language that will be used in all stage of development. Some of the libraries:
+
We regret not having achieved this point on time to be included in this version. Although we have already started some tests, we don’t have enough material yet to include some progress about it in this report. We want to make it clear that we are currently working on this point.
<!-- We are facing a Data Analytic problem: -->
+
 
 +
 
 +
<br >
 +
 
 +
===Front-end===
 +
To develop the front-end we have used a Python framework called Dash. This is a relatively new Framework, it is open source and "it built on top of ''Flask'', ''Plotly.js'', and ''React''. It enables you to build dashboards using pure Python". <ref name=":17" />
 +
 
 +
 
 +
According to its official documentation, "Dash is ideal for building data visualization apps. It's particularly suited for anyone who works with data in Python" <ref name=":18" />
  
*Pandas, NumPy, NLTK, Plotly, Cufflinks.
 
*Scrapy.
 
*Dash - Plotly.
 
  
 +
Dash is designed to integrate plots built with the graphing library Plotly. This is an open-source library that allows us to makes interactive, publication-quality graphs. So all the charts we will integrate to our web application will be built with Plotly <ref name=":26" />
  
:'''Why Python?'''
 
  
 +
Basically, we have followed the official documentation of both technologies, Dash and Plotly, to gain the necessary technical skills to carry out this project: \cite{dash} and \cite{plotly}
  
<br />
 
  
*<span style="background:#E6E6FA ">'''Backend:'''</span>
+
We think that there is no point in explaining the code we have programmed in this part. This code is so long that we consider unnecessary to spend so many pages and so much time on a technical explanation of the code. We think that it is more important to explain the rationale behind the decisions we made when defining the design and elements used in the development process.
 +
 
 +
 
 +
In Figures \ref{dashboard1}, \ref{dashboard2} and \ref{dashboard3} we show a couple of images of the application front-end.
 +
 
 +
 
 +
You can also access the source code and download it from our Github repository at http://github.com/adeloaleman/AmazonLaptopsDashboard
 +
 
 +
 
 +
In the following sections, we describe the main features and elements of the front-end development process.
 +
 
 +
 
 +
<br >
 +
====Some main features about the front-end design====
 +
 
 +
 
 +
<br >
 +
====Home page====
 +
 
 +
 
 +
<br >
 +
=====Brand selection, Series selection and Price panels=====
 +
 
  
::*'''Scrape data from Amazon:'''
+
<br >
:::*Technologies: [https://scrapy.org Scrapy] -  A free and open-source web-crawling framework written in Python. <ref name="scrapy_site">Scrapy.org, Official Scrapy website, https://scrapy.org/</ref> <ref name="scrapy_wikipedia">Wikipedia.org, Scrapy, https://en.wikipedia.org/wiki/Scrapy</ref>
+
=====Avg. customer reviews vs. Prices panel=====
  
  
::*'''Data Analytic:'''
+
<br >
:::*Natural language processing:
+
=====Customer reviews Word cloud and Word count bar chart=====
::::*Text pre-processing: Removing punctuation, Removing stopwords, Tokenization, etc.
 
::::*Sentiment Analysis.
 
:::*Text filtering.
 
:::*Technologies: Python. Some of the libraries: Pandas, NumPy, NLTK, Plotly, Cufflinks.
 
  
  
*<span style="background:#E6E6FA ">'''Frontend:'''</span>
+
<br >
 +
====Modules under construction====
  
::*Layout design and development.
 
::*Data visualization. We will use several kind of charts to visualize the data. E.g.:
 
:::*Word cloud: It will be used to visualize Word frequency in reviews.
 
:::*Bar chart: We will use this kind of charts to visualize data comparison (price comparison, average customer reviews comparison, Word frequency in reviews, etc.)
 
:::*Histogram.
 
:::*Pie Charts.
 
:::*Bubble plot.
 
::*Technologies: «[https://plot.ly/ Dash] is Python framework for building web applications. It built on top of [http://flask.palletsprojects.com/en/1.1.x/ Flask], [https://plot.ly/javascript/ Plotly.js], and [https://reactjs.org/ React Js]. It enables you to build dashboards using pure Python.» <ref name="dash_datacamp">datacamp.com, Dash for Beginners, https://www.datacamp.com/community/tutorials/learn-build-dash-python</ref>
 
<!-- :* Frontend development - Integration with the backend. -->
 
  
 +
<br >
  
<br />
+
==Cloud Deployment==
  
==Wireframe==
 
  
<figure id="fig:dashboard_ex2">
+
<figure id="fig:cloud1">
[[File:Dashboard wireframe.svg|center|thumb|850x587px|
+
[[File:AmazonLaptopsDashboard-cloud architecture1.png|center|1000px|
<caption> Wireframe
 
<br />
 
Built with https://www.lucidchart.com/<ref>Create Charts & Diagrams Online, https://www.lucidchart.com</ref>
 
<br />
 
Open Image in a new tab: [[Media:Dashboard wireframe.pdf]]</caption>
 
 
]]
 
]]
 +
<center>
 +
<caption>Cloud design for high availability and security. This is the latest deployment of the Dashboard</caption>
 +
</center>
 
</figure>
 
</figure>
  
  
<br />
+
 
==The data==
+
<figure id="fig:cloud2">
<figure id="fig:dashboard_ex2">
+
[[File:AmazonLaptopsDashboard-cloud architecture2.png|center|1000px|
[[File:DataScrapedFromAmazon_JSON.png|center|thumb|850x587px|
+
<caption>Cloud design of a Three-tier architecture for high availability.</caption>
<caption> Data Scraped from Amazon
 
<br />
 
Download or open the File from this link: [[Media:AmazonLaptopReviews.json]]</caption>
 
 
]]
 
]]
 +
<center>
 +
<caption>Cloud design of a Three-tier architecture for high availability.</caption>
 +
</center>
 
</figure>
 
</figure>
 +
 +
 +
<br >
 +
 +
==Conclusion==
  
  
 
<br />
 
<br />
 +
==References==
 +
 +
<div style="display: none">
 +
<!-- {{Reflist|2}}  --> 
 +
<!-- <references /> --> 
 +
<!-- <ref name="s_analysis">Carol Britton and Jill Doake, A Student Guide to Object-Oriented Development, 2005 , Elsevier.</ref> --> 
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":1">
 +
{{Cite web
 +
|title=How To Serve Flask Applications with Gunicorn and Nginx on Ubuntu 16.04
 +
|website=Digitalocean
 +
|url=https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-gunicorn-and-nginx-on-ubuntu-16-04
 +
|url-status=live
 +
|last=Ellingwood
 +
|first=Justin
 +
|date=May 2016
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":2">
 +
{{Cite web
 +
|title=Failed to find application object 'server' in 'app'
 +
|website=Plotly community
 +
|url=https://flask.palletsprojects.com/en/1.1.x/deploying/wsgi-standalone/#gunicorn
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=2018
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":3">
 +
{{Cite web
 +
|title=Deploying Gunicorn
 +
|website=Gunicorn documentation
 +
|url=https://docs.gunicorn.org/en/latest/deploy.html
 +
|url-status=live
 +
|last=Chesneau
 +
|first=Benoit
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":4">
 +
{{Cite web
 +
|title=Standalone WSGI Containers
 +
|website=Pallets documentation
 +
|url=https://flask.palletsprojects.com/en/1.1.x/deploying/wsgi-standalone/#gunicorn
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":5">
 +
{{Cite web
 +
|title=Deploying Dash Apps
 +
|website=Dash documentation
 +
|url=https://dash.plotly.com/deployment
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":6">
 +
{{Cite web
 +
|title=Dropdown Examples and Reference
 +
|website=Dash documentation
 +
|url=https://dash.plotly.com/dash-core-components/dropdown
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":7">
 +
{{Cite web
 +
|title=Faction A - Sentiment Analysis Dashboard
 +
|website=Microsoft
 +
|url=https://powerbi.microsoft.com/en-us/partner-showcase/faction-a-sentiment-analysis-dashboard/
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":8">
 +
{{Cite web
 +
|title=Bootstrap Navigation Bar
 +
|website=towardsdatascience.com
 +
|url=https://towardsdatascience.com/e-commerce-reviews-analysis-902210726d47
 +
|url-status=live
 +
|last=Ka Hou
 +
|first=Sio
 +
|date=2019
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":9">
 +
{{Cite web
 +
|title=4 Áreas con importantes datos por analizar
 +
|website=UCOM - Universidad comunera
 +
|url=http://www.ucom.edu.py/4-areas-con-importantes-datos-por-analizar
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=2019
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":10">
 +
{{Cite web
 +
|title=Bootstrap Navigation Bar
 +
|website=W3schools documentation
 +
|url=https://www.w3schools.com/bootstrap/bootstrap_navbar.asp
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":11">
 +
{{Cite web
 +
|title=What is a data dashboard?
 +
|website=Klipfolio
 +
|url=https://www.klipfolio.com/resources/articles/what-is-data-dashboard
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":12">
 +
{{Cite web
 +
|title=A Student Guide to Object-Oriented Development
 +
|website=
 +
|url=https://www.sciencedirect.com/book/9780750661232/a-student-guide-to-object-oriented-development
 +
|url-status=live
 +
|last=Carol Britton and Jill Doake
 +
|first=
 +
|date=2005
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":13">
 +
{{Cite web
 +
|title=Official Scrapy website
 +
|website=Scrapy
 +
|url=https://scrapy.org/
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":14">
 +
{{Cite web
 +
|title=Scrapy 2.1 documentation
 +
|website=Scrapy documentation
 +
|url=https://docs.scrapy.org/en/latest/
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":15">
 +
{{Cite web
 +
|title=Scrapy
 +
|website=Wikipedia
 +
|url=https://en.wikipedia.org/wiki/Scrapy
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":16">
 +
{{Cite web
 +
|title=Web scraping
 +
|website=Wikipedia
 +
|url=https://en.wikipedia.org/wiki/Web_scraping
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":17">
 +
{{Cite web
 +
|title=Dash for Beginners
 +
|website=Datacamp
 +
|url=https://www.datacamp.com/community/tutorials/learn-build-dash-python
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=2019
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":18">
 +
{{Cite web
 +
|title=Introduction to Dash
 +
|website=Dash documentation
 +
|url=https://dash.plotly.com/introduction
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":19">
 +
{{Cite web
 +
|title=Create Charts & Diagrams Online
 +
|website=Lucidchart
 +
|url=https://www.lucidchart.com
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":20">
 +
{{Cite web
 +
|title=Shards dashboard - Demo
 +
|website=Designrevision
 +
|url=https://designrevision.com/demo/shards-dashboard-lite-react/blog-overview
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":21">
 +
{{Cite web
 +
|title=Upcube dashboard - Demo
 +
|website=themesdesign
 +
|url=https://themesdesign.in/upcube/layouts/horizontal/index.html
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
-->   
 +
<ref name=":22">
 +
{{Cite web
 +
|title=W3.CSS Sidebar
 +
|website=W3schools documentation
 +
|url=https://www.w3schools.com/w3css/w3css_sidebar.asp
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":23">
 +
{{Cite web
 +
|title=Sidebar code example
 +
|website=W3schools documentation
 +
|url=https://www.w3schools.com/w3css/tryit.asp?filename=tryw3css_sidebar_shift
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":24">
 +
{{Cite web
 +
|title=Navbar
 +
|website=Bootstrap documentation
 +
|url=https://dash-bootstrap-components.opensource.faculty.ai/docs/components/navbar/#
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":25">
 +
{{Cite web
 +
|title=Bubble Charts in Python
 +
|website=Plotly documentation
 +
|url=https://plotly.com/python/bubble-charts/
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":26">
 +
{{Cite web
 +
|title=Plotly Python Open Source Graphing Library
 +
|website=Plotly documentation
 +
|url=https://plotly.com/python/
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
 +
 +
 +
--> 
 +
<ref name=":27">
 +
{{Cite web
 +
|title=Bar Charts in Python
 +
|website=Plotly documentation
 +
|url=https://plotly.com/python/bar-charts/
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
<!--
  
==References==
 
<!-- {{Reflist|2}} -->
 
  
<references />
+
--> 
 +
<!--
 +
<ref name=":6">
 +
{{Cite web
 +
|title=
 +
|website=
 +
|url=
 +
|url-status=live
 +
|last=
 +
|first=
 +
|date=
 +
|access-date=
 +
}}
 +
</ref>
 +
--> 
 +
<!--
  
  
<br />
+
--> 
 +
<!-- <ref name=":0" /> -->
 +
</div>

Latest revision as of 23:03, 25 January 2021



Introduction

When I started thinking about this project, the only clear point was that I wanted to work in Data Analysis using Python. I had already got some experience in this field in my final BSc in IT Degree project, working on a Supervised Machine Learning model for Fake News Detection. So, this time, I had some clearer ideas of the scope of data analysis and related fields and about the kind of project I was looking to work on. In addition to data analysis, I'm also interested in Web development. Therefore, in this project, along with Data Analysis, I was also looking to give important weight to web development. In this context, I got with the idea of developing a Web Dashboard for analyzing Amazon Laptop Reviews.


In a general sense, "a Data Dashboard is an information management tool that visually tracks, analyzes and displays key performance indicators (KPI), metrics and key data points to monitor the health of a business or specific process" [1]. That is not a bad definition to describe the application we are building. We just need to highlight that our Dashboard displays information about Laptop sales data from Amazon (customer reviews, in particular).


We wanted to mention that the initial proposal was to develop a Dashboard for analyzing a wider range of products from Amazon (not only Laptops). However, because we had limited HR and time to accomplish this project, we were forced to reduce the scope of the application. This way, we were trying to make sure that we were going to be able to develop a functional Dashboard on the timeframe provided.


It is also important to highlight that the visual appeal of the application is being considered an essential aspect of the development process. We are aware that this is currently a very important feature of a web application so we are taking the necessary time to make sure we develop a Dashboard with visual appeal and decent web design.



Some examples of the kind of Application we are developing:

  • Example 1: Faction A - Sentiment Analysis Dashboard (Figure 1) [2]
This is a very similar application to the one we are building. Our initial requirements and prototype were based on some of the components shown in this application.


Figure 1: Faction A - Sentiment Analysis Dashboard
This is an example of the kind of Dashboard we are building [2]
https://www.youtube.com/watch?v=R5HkXyAUUII



  • Example 2: Shards dashboard (Figure 2) [3]
This dashboard doesn't have a similar function to the dashboard we are building. However, it is a good example of the design and visual appeal of the application we aim to build.


Shards dashboard. This is an example of the kind of Dashboard we are building [3]


At this point, a version 1.0 of the application has already been built and deployed. It is currently running for testing purposes at http://dashboard.sinfronteras.ws


You can also access the source code and download it from our Github repository at http://github.com/adeloaleman/AmazonLaptopsDashboard


The content of this report is organized in the following way:

  • We start by giving some arguments that justify the business value of this kind of application. We understand that the scope of our Dashboard is limited because we are only considering Laptop data, but we try to explain in a broad sense, the advantage of analyzing retail sales data in order to enhance a commercial strategy.


  • In the Development process Section, we will go through the different phases of the development process. The goal is to make clear the general architecture of the application, the technologies we have used and the reasons for the decisions made. Only some portions of the codes will be explained. We think there is no point in explaining all programming details of hundreds of lines of code.


  • We finally describe the process followed to deploy the application in the cloud. In this chapter, we explain the current deployment architecture, in which the back-end and front-end are running in the same server, but we also address a three-tier architecture for high availability and strict security policies that has been designed for future cloud deployment of the Dashboard.



Project rationale and business value

In marketing and business intelligence, it is crucial to know what customers think about target products. Customer opinions, reviews, or any data that reflect the experience of the customer, represents important information that companies can use to enhances their marketing and business strategies.

Marketing and business intelligence professionals, also need a practical way to display and summary the most important aspects and key indicators about the dynamic of the target market; but, what we want to say when we refer to the dynamic of the market? We are going to use this term to refer to the most important information that business professional requires to understand the market and thus be able to make decisions that seek to improve the revenue of the company.

Now, let's explain with a practical example, which kind of information business professionals need to know about a target product or market. Suppose that you are a Business Intelligence Manager working for an IT company (Lenovo, for example). The company is looking to improve its Laptops sale strategy. So, which kind of information do you need to know, to be able to make key decisions related to the Tech Specs and features that the new generation of Laptops should have to become a more attractive product into the market? You would need to analyses, for instance:

  • Which are the best selling Laptops.
  • Are Lenovo Laptops in a good position in the market?
  • What are the top Lenovo Competitors in the industry.
  • What are the key features that customers take into consideration when buying a Laptop.
  • What are the key tech specs that customers like (and don't like) about Lenovo and Competitors Laptops.
  • How much customers are willing to pay for a Laptop.

Those are just some examples of the information a business intelligence professional need to know when looking for the best strategy. Let's say that, after analyzing the data, you found that the top-selling Laptops are actually the most expensive ones. Laptops with high-quality tech specs and performance. You also found that Lenovo Laptops are, in general, under the range of prices and quality tech specs of the top-selling Laptops.


With the above information, a logical strategy could be to invest in an action plan to improve the tech specs and general quality of Lenovo Laptops. If, on the contrary, the information highlights that very expensive Laptops have a very low demand, an intelligent approach could be a strategy to reduce the cost of the new generation of Laptops.


So, we have already seen the importance of analyzing relevant data to understand the dynamics of the market when looking to enhance the business strategy. Now, from where and how can we get the necessary data to perform a market analysis for a business plan?


This kind of data can be collected by asking directly information from retailers. For example, if you have access to a detailed Annual Report for Sales & Marketing of a computer retailer, you will have the kind of information that can be valuable to understand the dynamic of the market. This report could contain details about the best selling computers, prices, tech specs, revenues, etc. However, from a Sales Annual report would be missing detailed information about what customers think of the product they bought. Traditionally, this kind of data has been collected using methods such as face to face or telephone surveys.


Recently, the huge amount of data generated every day in social network and online retailer, is being used to perform analysis that allows us to gain a better understanding of the market and, in particular, about customer opinions. This method is becoming a more effective, practical, and cheaper way of gathering this kind of information compare to traditional methods.



Development process

Figure 3: Application architecture diagram: This is the big picture of the application operation. Notice the different components and technologies, and the connection between them.
This is the current implementation. All the components are running on the same server. However, the goal is to implement a 3-Tier Architecture as shown in Figure 4.


Figure 4: Three-Tier application architecture diagram. This will be the final deployment of the application.



Back-end


Scraping data from Amazon

We first need to get the data that we want to display and analyze in the Dashboard.


As we have already mentioned, we want to extract data related to laptops for sale from www.amazon.com. The goal is to collect the details of about 100 Laptops from different brands and models and save this data as a JSON text file.


To better explain which information we need, In Figure 5, we show a Laptop sale page from https://www.amazon.com/A4-53N-77AJ/dp/B07QXL8YCX. From that webpage, we need to extract the following information:


Main details Tech details Reviews
  • Screen Size: 14 inches
  • Max Screen Resolution: 1920 x 1080
  • Processor: 4.6 GHz Core i7 Family
  • RAM: 16 GB DDR4
  • Hard Drive: 512 GB Flash Memory Solid State
  • .
  • .
  • .
  • Review 1:
    • User name: Tom Steele
    • Rating: 4.0 out of 5 stars
    • Date: June 13, 2019
    • Review title: If only it had a USB-C port!
    • Review: This is a great computer in many respects, The specs of the computer are great and it is fast and snappy. The 512 GB SSD feels fast and speedy. The 16Gb of fast ram is nice. The i7 processor is a beast, with speed (and noisy fan) bursts when needed. This is a fast, well-optioned machine inside...
  • Review 2:
  • .
  • .
  • .
  • Review 3:
  • .
  • .
  • .


The process of extracting data from Websites is commonly known as web scraping [4]. A web scraping software can be used to automatically extract large amounts of data from a website and save the data for later processing. [5]


In this project, we are using Scrapy as web scraping solution. This is one of the most popular Python Web Scraping frameworks. According to its official documentation, "Scrapy is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages" [6]


The Code Snippet included in Figure 6 shows an example of a Scrapy Python code. This is a portion of the program we built to extract Laptop data from www.amazon.com. The goal of this report is not to explain all the technical details about the code. That would be such long documentation so it is better to review the official Scrapy documentation for more details. [7]


You can access the source code from our Github repository at https://github.com/adeloaleman/AmazonLaptopsDashboard/blob/master/AmazonScrapy/AmazonScrapy/spiders/amazonSpider.py


Web scraping can become a complex task when the information we want to extract is structured on more than one page. For example, in our project, the most important data we need is customer reviews. Because a Laptop in Amazon can have countless reviews, this information can become so extensive that it cannot be displayed on a single page but on a set of similar pages.


Let's explain the process and some of the complexities of extracting Laptop data from Amazon.com:





  • So, to extract information from a large number of Laptops, we have to scrape a sequence of similar pages.


  • Most of the information we want to extract from each Laptop will be on the Laptop main page. As we saw in Figure 5, we can get the "ASIN", "Price", "Average customer reviews" and other details from this main page.


  • However, customer reviews are such a long data that cannot be displayed on only one page but on a sequence of pages linked to the main Laptop page. So, in the case of our example Laptop page (Figure 5), customer reviews are displayed in these pages:




  • ...





import scrapy

class QuotesSpider(scrapy.Spider):
     name = "amazon_links"

     start_urls=[]

     # Link to the base Amazon laptos page. This page display the first group of Laptops 
     # available at Amazon.com including all branches and features:
     myBaseUrl = 'https://www.amazon.com/Laptops-Computers-Tablets/s?rh=n%3A565108&page='
     for i in range(1,3):
          start_urls.append(myBaseUrl+str(i))


     def parse(self, response):
          data = response.css("a.a-text-normal::attr(href)").getall()
          links =   [s for s in data if "Lenovo-"  in  s
                                     or "LENOVO-"  in  s
                                     or "Hp-"      in  s
                                     or "HP-"      in  s
                                     or "Acer-"    in  s
                                     or "ACER-"    in  s
                                     or "Dell-"    in  s
                                     or "DELL-"    in  s
                                     or "Samsung-" in  s
                                     or "SAMSUNG-" in  s
                                     or "Asus-"    in  s
                                     or "ASUS-"    in  s
                                     or "Toshiba-" in  s
                                     or "TOSHIBA-" in  s
                    ]

          links = list(dict.fromkeys(links))
          links = [s for s in links if "#customerReviews"   not in s]
          links = [s for s in links if "#productPromotions" not in s]

          for  i in range(len(links)):
               links[i] = response.urljoin(links[i])
               yield response.follow(links[i], self.parse_compDetails)


     def parse_compDetails(self, response):
          def extract_with_css(query):
               return response.css(query).get(default='').strip()

          price = response.css("#priceblock_ourprice::text").get()


          product_details_table  = response.css("#productDetails_detailBullets_sections1")
          product_details_values = product_details_table.css("td.a-size-base::text").getall()
          k = []
          for i in product_details_values:
               i = i.strip()
               k.append(i)
          product_details_values = k

          ASIN = product_details_values[0]

          average_customer_reviews = product_details_values[4]

          number_reviews_div = response.css("#reviews-medley-footer")
          number_reviews_ratings_str  = number_reviews_div.css("div.a-box-inner::text").get()
          number_reviews_ratings_str  = number_reviews_ratings_str.replace(',', '')
          number_reviews_ratings_str  = number_reviews_ratings_str.replace('.', '')
          number_reviews_ratings_list = [int(s) for s in number_reviews_ratings_str.split() if s.isdigit()]
          number_reviews = number_reviews_ratings_list[0]
          number_ratings = number_reviews_ratings_list[1]


          reviews_link = number_reviews_div.css("a.a-text-bold::attr(href)").get() 
          reviews_link = response.urljoin(reviews_link)


          tech_details1_table  = response.css("#productDetails_techSpec_section_1")
          tech_details1_keys   = tech_details1_table.css("th.prodDetSectionEntry")
          tech_details1_values = tech_details1_table.css("td.a-size-base")

          tech_details1 = {}
          for i in range(len(tech_details1_keys)):
               text_keys   = tech_details1_keys[i].css("::text").get()
               text_values = tech_details1_values[i].css("::text").get()

               text_keys   = text_keys.strip()
               text_values = text_values.strip()

               tech_details1[text_keys] = text_values


          tech_details2_table  = response.css("#productDetails_techSpec_section_2")
          tech_details2_keys   = tech_details2_table.css("th.prodDetSectionEntry")
          tech_details2_values = tech_details2_table.css("td.a-size-base")

          tech_details2 = {}
          for i in range(len(tech_details2_keys)):
               text_keys   = tech_details2_keys[i].css("::text").get()
               text_values = tech_details2_values[i].css("::text").get()

               text_keys   = text_keys.strip()
               text_values = text_values.strip()

               tech_details2[text_keys] = text_values

          tech_details = {**tech_details1 , **tech_details2}          


          reviews = []
          yield response.follow(reviews_link,
                                self.parse_reviews,
                                meta={
                                   'url': response.request.url,
                                   'ASIN': ASIN,
                                   'price': price,
                                   'average_customer_reviews': average_customer_reviews,
                                   'number_reviews': number_reviews,
                                   'number_ratings': number_ratings,
                                   'tech_details': tech_details,
                                   'reviews_link': reviews_link,
                                   'reviews': reviews,
                                })
Figure 6: Portion of the Python Scrapy program we built to extract Laptop data from www.amazon.com



Figure 7: Data Scraped from Amazon using Scrapy. Download or open the File from Media:AmazonLaptopReviews.json



Data Analytic


  • Loading the data
  • Data Preparation and Text pre-processing
  • Creating new columns to facilitate handling of customer reviews ant tech details
  • Modifying the format and data type of some columns
  • Removing punctuation and stopwords
  • Data visualization
  • Avg. customer reviews \& Avg. price bar charts
  • Avg. customer reviews Vs. Price bubble chart
  • Customer reviews word cloud
  • Word count bar chart of customer reviews
  • Sentiment analysis



Loading the data

After extracting the data from Amazon using Scrapy, we have stored the data into a simple JSON text file (Figure \ref{fig:data_json}). Here we are importing the data from the JSON text file into a pandas dataframe:




Data Preparation and Text pre–processing

In the appendices, we have included the Python code used for Data Preparation and Text pre-processing. Below, we will explain step by step the process followed to prepare the data


Creating new columns to facilitate handling of customer reviews and tech details:

After loading the data from the JSON file, every "review" entry is a dictionary type value that is composed of several fields: customer name, rating, date, title, and the text of the review itself.


Here we extract the relevant details (title and the text of the review itself) and create 3 new columns to facilitate the handling of the "review" entries. We create the following columns: "reviews_title", "reviews_text" and "reviews_one_string":


.


After loading the data from the JSON file, all technical details are in a dictionary type entry. In the following block, we are extracting the tech details that are important for our analysis ("series" and "model_number")and creating new columns for each of these relevant tech details:


.



Modifying the format and data type of some columns:

Here we make sure that the first character of the brand name is uppercase and remaining characters lowercase. This is important because we are going to perform filtering and searching function using the brand name so we need to make sure the writing is consistent:


.


After extracting the data from the web page, the numeric values ("average_customer_reviews" and "price")are actually of "string" type. So, We need to convert the entry to a numeric type (Float). This is necessary because we will perform mathematical operations with these values.

The following function takes a numeric string (<class 'str>'), removes any comma or dollar characters ("," "$") and returns a numeric float value (<class 'float'>):

.


A raw "average_customer_reviews" entry looks like this: "4.5 out of 5 stars" (<class 'str>')

We only need the first value as a numeric float type: 4.5 (<class 'float'>)

This is done in the next line of code over the entire dataframe by selecting only the first element ("4.5" in the above example) and applying the "format_cleaner()" function to the "average_customer_reviews" column:

.

A raw "price" entry looks like this: "$689.90" (<class 'str'>) We only need the numeric value: 689.90 (<class 'float'>) This is done in the next line of code over the entire dataframe by applying the "format_cleaner()" function to the "price" column:

.



Removing punctuation and stopwords:

  • Punctuation: We will remove all punctuation char found thestringPython library.
  • Our stopwords will be composed by:
  • The common stopwords defined in the nltk library
  • Some particular stopwords related to our data:
  • Brand names: There is no point in analyzing brand names. For instance, in a Lenovo review, the customer will use the word "Lenovo" many times, but this fact does not contribute anything to the analysis.
  • Laptop synonyms: laptop, computer, machine, etc.
  • Some no-official contractions that are not in the nltk library: Im, don't, Ive, etc.



Defining our stopwords list:

.


The following function takes a string and returns the same string without punctuation or stopwords:

.


Example of applying the "pre_processing()" function :

.


Here we are applying the function "pre_processing()" to the "reviews_one_string" column over the entiredataframe:

.


At this point, the data is ready for visualization at the front-end:

.



Data visualization

As we mentioned, this version 1.0 hasn't been clearly defined in a 3-tiers architecture yet. At this point, everything is running in one server (Figure ??). However, the analysis has always been thought with the goal of deploying the application in a 3-tiers architecture (Figure ??).


Regarding chart building for data visualization, it has been particularly challenging do define the tier of the application architecture (back-end or front-end) in which chart building must be placed.


Although it is clear that the charts itself are part of the frontend, there is usually a data processing step -linked with graphing- that could be placed at the backend.


This is why we prefer to explain the programming details related to data visualization in this section. However, other data visualization aspects related to the same charts explained in this section will be treated in the Front-end section.



Avg customer reviews vs Avg price bar charts

We will only talk about the code to generate the charts in this section. We will explain why we have included these charts in the application and discuss other analytical aspects in the Front-end section.


.



Avg customer reviews vs Price bubble chart
.



Customer reviews word cloud
.



Word count bar chart of customer reviews
.



Sentiment analysis

We regret not having achieved this point on time to be included in this version. Although we have already started some tests, we don’t have enough material yet to include some progress about it in this report. We want to make it clear that we are currently working on this point.



Front-end

To develop the front-end we have used a Python framework called Dash. This is a relatively new Framework, it is open source and "it built on top of Flask, Plotly.js, and React. It enables you to build dashboards using pure Python". [8]


According to its official documentation, "Dash is ideal for building data visualization apps. It's particularly suited for anyone who works with data in Python" [9]


Dash is designed to integrate plots built with the graphing library Plotly. This is an open-source library that allows us to makes interactive, publication-quality graphs. So all the charts we will integrate to our web application will be built with Plotly [10]


Basically, we have followed the official documentation of both technologies, Dash and Plotly, to gain the necessary technical skills to carry out this project: \cite{dash} and \cite{plotly}


We think that there is no point in explaining the code we have programmed in this part. This code is so long that we consider unnecessary to spend so many pages and so much time on a technical explanation of the code. We think that it is more important to explain the rationale behind the decisions we made when defining the design and elements used in the development process.


In Figures \ref{dashboard1}, \ref{dashboard2} and \ref{dashboard3} we show a couple of images of the application front-end.


You can also access the source code and download it from our Github repository at http://github.com/adeloaleman/AmazonLaptopsDashboard


In the following sections, we describe the main features and elements of the front-end development process.



Some main features about the front-end design


Home page


Brand selection, Series selection and Price panels


Avg. customer reviews vs. Prices panel


Customer reviews Word cloud and Word count bar chart


Modules under construction


Cloud Deployment

AmazonLaptopsDashboard-cloud architecture1.png

Figure 8: Cloud design for high availability and security. This is the latest deployment of the Dashboard


Figure 9: Cloud design of a Three-tier architecture for high availability.

Figure 9: Cloud design of a Three-tier architecture for high availability.



Conclusion


References

[11] [12] [13] [14] [15] [16] [2] [17] [18] [19] [1] [20] [6] [7] [5] [4] [8] [9] [21] [3] [22] [23] [24] [25] [26] [10] [27]

  1. 1.0 1.1 "What is a data dashboard?". Klipfolio.
  2. 2.0 2.1 2.2 "Faction A - Sentiment Analysis Dashboard". Microsoft.
  3. 3.0 3.1 3.2 "Shards dashboard - Demo". Designrevision.
  4. 4.0 4.1 "Web scraping". Wikipedia.
  5. 5.0 5.1 "Scrapy". Wikipedia.
  6. 6.0 6.1 "Official Scrapy website". Scrapy.
  7. 7.0 7.1 "Scrapy 2.1 documentation". Scrapy documentation.
  8. 8.0 8.1 "Dash for Beginners". Datacamp. 2019.
  9. 9.0 9.1 "Introduction to Dash". Dash documentation.
  10. 10.0 10.1 "Plotly Python Open Source Graphing Library". Plotly documentation.
  11. Ellingwood, Justin (May 2016). "How To Serve Flask Applications with Gunicorn and Nginx on Ubuntu 16.04". Digitalocean.
  12. "Failed to find application object 'server' in 'app'". Plotly community. 2018.
  13. Chesneau, Benoit. "Deploying Gunicorn". Gunicorn documentation.
  14. "Standalone WSGI Containers". Pallets documentation.
  15. "Deploying Dash Apps". Dash documentation.
  16. "Dropdown Examples and Reference". Dash documentation.
  17. Ka Hou, Sio (2019). "Bootstrap Navigation Bar". towardsdatascience.com.
  18. "4 Áreas con importantes datos por analizar". UCOM - Universidad comunera. 2019.
  19. "Bootstrap Navigation Bar". W3schools documentation.
  20. Carol Britton and Jill Doake (2005). "A Student Guide to Object-Oriented Development".
  21. "Create Charts & Diagrams Online". Lucidchart.
  22. "Upcube dashboard - Demo". themesdesign.
  23. "W3.CSS Sidebar". W3schools documentation.
  24. "Sidebar code example". W3schools documentation.
  25. "Navbar". Bootstrap documentation.
  26. "Bubble Charts in Python". Plotly documentation.
  27. "Bar Charts in Python". Plotly documentation.