4/28/2015

Ask HN: How do you log application events?

We are currently inserting our logs in an sql database, with timestamp, logType, userId, userAgent and description columns. It makes it trivial for us to debug any event by just querying the db. However, after three and a half years of continued use, the table is now way too large.
How do you guys log application events in such a way that extracting information from it is easy, but still keep the size of the logs manageable?



Ehm, the contrast between my answer and everyone else's here makes me feel surprisingly greybearded, but...Application logging has been a solved problem for decades now. syslog or direct-to-disk in a reasonable format, let logrotate do the job it's faithfully done for years and let the gzipped old files get picked up by the offsite backups that you're surely running, and use the standard collection of tools for mining text files: grep, cut, tail, etc.I'm a little weirded out that "my logs are too big" is still a thing, and that the most common answer to this is "glue even more complexity together".
reply


100% agree. Don't try to build your own when there are some excellent free (and commercial) ones that are battle-tested.grep, cut, tail, etc. work quite well if you're working on a single machine or small number of machines.
The "ELK stack" (ElasticSearch, Logstash, Kibana) is a step up in complexity but gives you much more power than command line tools.
There are also some great commercial solutions that abstract away some of that complexity if you don't feel like rolling your own (Scalyr, Splunk, SumoLogic, etc.).
But regardless of the path you take, don't reinvent the wheel!

FIFA 2014 World Cup live stream architecture




live_stream_nginx

We were given the task to stream the FIFA 14 World Cup and I think this was an experience worth sharing. This is a quick overview about: the architecture, the components, the pain, the learning, the open source and etc.

The numbers

  • GER 7×1 BRA (yeah, we’re not proud of it)
  • 0.5M simultaneous users @ a single game – ARG x SUI
  • 580Gbps @ a single game – ARG x SUI
  • =~ 1600 watched years @ the whole event

The core overview

The project was to receive an input stream, generate HLS output stream for hundreds of thousands and to provide a great experience for final users:
  1. Fetch the RTMP input stream
  2. Generate HLS and send it to Cassandra
  3. Fetch binary and meta data from Cassandra and rebuild the HLS playlists with Nginx+lua
  4. Serve and cache the live content in a scalable way
  5. Design and implement the player
If you want to understand why we chose HLS check this presentation only in pt-BR. tip: sometimes we need to rebuild some things from scratch.

The input

The live stream comes to our servers as RTMP and we were using EvoStream (now we’re moving to nginx-rtmp) to receive this input and to generate HLS output to a known folder. Then we have some python daemons, running at the same machine, watching this known folder and parsing the m3u8 and posting the data to Cassandra.
To watch files modification and to be notified by these events, we first tried watchdog but for some reason we weren’t able to make it work as fast as we expected and we changed to pyinotify.
Another challenge we had to overcome was to make the python program scale to x cpu cores, we ended up by creating multiple Python processes and using async execution.
tip: maybe the best language / tool is in another castle.

The storage

We previously were using Redis to store the live stream data but we thought Cassandra was needed to offer DVR functionality easily (although we still uses Redis a lot). Cassandra response time was increasing with load to a certain point where clients started to timeout and the video playback completely stopped.
We were using it as Queue-like which turns out to be a anti-pattern. We then denormalized our data and also changed to LeveledCompactionStrategy as well as we set durable_writes to false, since we could treat our live stream as ephemeral data.
Finally, but most importantly, since we knew the maximum size a playlist could have, we could specify the start column (filtering with id > minTimeuuid(now – playlist_duration)). This really mitigated the effect of tombstones for reads. After these changes, we were able to achieve a latency in the order of 10ms for our 99% percentile.
tip: limit your queries + denormalize your data + send instrumentation data to graphite + use SSD.

The output

With all the data and meta-data we could build the HLS manifest and serve the video chunks. The only thing we were struggling was that we didn’t want to add an extra server to fetch and build the manifests.
Since we already had invested a lot of effort into Nginx+Lua, we thought it could be possible to use lua to fetch and build the manifest. It was a matter of building a lua driver for Cassandra and use it. One good thing about this approach (rebuilding the manifest) was that in the end we realized that we were almost ready to serve DASH.
tip: test your lua scripts + check the lua global vars + double check your caching config

The player

In order to provide a better experience, we chose to build Clappr, an extensible open-source HTML5 video player. With Clappr – and a few custom extensions like PiP (Picture In Picture) and Multi-angle replays – we were able to deliver a great experience to our users.
tip: open source it from day 0 + follow to flow issue -> commit FIX#123

The sauron

To keep an eye over all these system, we built a monitoring dashboard using mostly open source projects like: logstash, elastic search, graphite, graphana, kibana, seyren, angular, mongo, redis, rails and many others.
tip: use SSD for graphite and elasticsearch

The bonus round

Although we didn’t open sourced the entire solution, you can check most of them:
Source: http://leandromoreira.com.br/2015/04/26/fifa-2014-world-cup-live-stream-architecture/

1/28/2015

DIY Device Cloud

  • Motivation and Goals

    What is a device cloud?
    I consider a device cloud to be a service that provides both a broker for communication between devices and a host for applications that interact with devices. The broker allows devices to communicate to other devices or applications without having to be tightly coupled or directly connected to each other. The device cloud is also an easily accessible host for applications that interact with devices, such as visualizations of data or control of device functionality.
    Why do you want a device cloud?
    • Discovering and connecting to your devices over the internet is hard! Dynamic IP addresses, NAT, firewalls, and more prevent you from easily accessing your home network and devices. Hosting your own device cloud on the internet allows devices to connect and communicate without the trouble of opening connectivity to your home network.
    • Not all devices can be connected and online all the time. A device cloud allows for asynchronous communication between devices. For example a sensor or actuator can periodically connect to your device cloud to record data or receive new instructions; because the device isn't connected all the time it can greatly reduce its power consumption.
    • You have total control over the data and services available to your devices. You aren't limited by what a 3rd party provides or concerned about what they might do with your data. You own the entire infrastructure and can mold it to your needs.
    Project Goals
    • Automate the setup and maintenance of services that form a personal device cloud. Setup should be an easily repeatable process from a single command or script.
    • Target running the device cloud on cheap, low end cloud infrastructure like Amazon EC2's micro instance free tier.
    • Build on existing tools, services, and protocols--don't reinvent the wheel! Many of the services to build a device cloud exist today, they just need to be put together into an easy to use package.
    • Document how to connect popular development hardware like the Raspberry Pi, Beaglebone Black, and Arduino to a device cloud.
  • MQTT - Broker That Ties Everything Together

    MQTT is a publish-subscribe protocol that's perfect as a broker for device to device communication.  Created by Andy Stanford-Clark and Arlen Nipper, it has been used in projects like Andy's own 'Twittering House'.  You can see more from Andy in this TEDx talk:
    Mosquitto is an open source MQTT broker implementation that's under active development, has an established reputation, and is well documented.  I've done some informal testing with Mosquitto and it seems like the perfect broker to use in this project.
  • Ansible - Automation That's Easy

    Ansible is part of a new generation of open source IT automation tools which aim to simplify the automation of tasks like installing software and setting up servers.  You can see more about Ansible from its creator in this recent PyCon talk:
    Ansible strives to be simple and requires very little support infrastructure or complex setup.  Even though the project is only a few years old, it is already one of the largest on github and home to a healthy ecosystem of modules that can automate almost any server task.
    I've been investigating Ansible and expect to make heavy use of it as the primary means of automating the setup and maintenance of the device cloud.  The primary output of this project will likely be a collection of Ansible playbooks and tasks that help you build and manage your own device cloud.
  • Architecture Sketch

    Here's a high level look at how I see different components fitting together to form a device cloud:

    The MQTT broker, mosquitto, is at the center of all communication and lives on a server in Amazon EC2, Microsoft Azure, etc.  Along with the broker other applications like visualizations or control apps are hosted on the same server and can communicate with the broker directly.  I have some early thoughts on these apps and will save that for a followup post.
    Devices in your home network, or really anywhere, connect to the MQTT broker using an encrypted SSL MQTT channel.  Normally SSL is a tremendous pain to configure and manage (especially with self-signed certificates), however I've found it's easy to automate the certificate creation tasks using Ansible so it should be painless.  I want the device cloud to be secure by default and not require someone to be an expert in security to setup or run the system.
    Devices can be any hardware that's powerful enough to speak MQTT's protocol.  Embedded Linux boards like the Raspberry Pi, Beaglebone Black, Arduino Yun, Intel Galileo, etc. will be very easy to configure and use with the device cloud.  I hope to even provide Ansible tasks that can deploy and configure MQTT client tools and libraries on any embedded Linux device automatically.
    What about devices which don't support SSL, like an Arduino & CC3000?  In this case mosquitto has a concept of a bridge server which can act both as an MQTT broker and client.  Any MQTT messages sent to the bridge will be relayed back up to the parent broker and vice-versa.  With this setup it will be possible to build a small, low power Arduino + CC3000 device that periodically connects to the bridge over an unencrypted channel to send/receive commands.  Because the bridge lives inside your home network (which is secured with wireless encryption, right?) it's still relatively secure.  Certainly more secure than opening the broker in the cloud up to unencrypted MQTT traffic!  Again Ansible tasks should be able to turn any Linux gadget into a bridge for the device cloud easily.
  • Github Home & Documentation

    I've started a repository on github that will be the home for all the software output from this project.  Since documentation will be an important part of this project I've also created a base for the documentation at http://diy-device-cloud.readthedocs.org/ using the excellent readthedocs.org service.  The documentation is actually built from files in the github repository for the project (under the docs folder) based on the sphinx tool.  Because the docs are in the repository it's easy to keep them versioned and up to date.  For now I've put up a couple docs that recap what I've covered so far in the project logs here.
  • MQTT Broker Setup On Amazon EC2

    In the past couple days I've added playbooks to the project github source which allow you to provision an MQTT broker server on Amazon's EC2 cloud and set up a basic device cloud.  Check out this video to see a walkthrough of setting up a server and communicating with a Raspberry Pi: 
    I'm still a little new to using Ansibile, but so far am really happy with how simple it's making the process of standing up a server from scratch.
    Also I purchased the 'diydevicecloud.net' domain name and am making subdomains of it available to anyone who wants to host their device cloud from an easily reachable address with the excellent FreeDNS.afraid.org dynamic DNS service.  You can find the domain here and just need a free account on FreeDNS to get a subdomain.  I've even integrated into the Ansible playbook a step to register a regular cron job that will keep your broker server IP address up to date with your FreeDNS domain.  Watch the video to see more details of the process.
    You aren't limited to using this domain either--there are many other domains on FreeDNS, or you can even buy your own domain and get a static IP in Amazon's cloud.
    The next area I want to focus on are client apps.  As a part of this project I want to document a few end to end examples of MQTT applications, such as measuring sensor data or remotely controlling devices.
  • Server Deployment Updates

    I've done some cleaning up and refactored the Ansible playbooks to use roles as suggested by Ansible's best practices.  I'm not super happy with how the system is configured--right now global configuration is in the inventory file under the [all:vars] group.  This is a good spot to configure the entire system in one file but feels a little unwieldily.   I'll be looking more into other options here to see if there's a simpler and more flexible way to configure the system.
    I've also added quite a few roles to form the base of the server, including:
    • MQTT server: installs mosquitto server and configures it to use SSL secured MQTT communication.
    • MQTT client: installs mosquitto client tools, Paho MQTT python client library, and the SSL certs for communication with the server.  This role can be installed on both the Ubuntu-based server and Debian-based devices like the Raspberry Pi or Beaglebone Black.
    • Security: installs and configures fail2ban brute force attack protection package.  Ubuntu server is also pretty well locked down by only allowing SSH login with keys (i.e. no password auth allowed), and denying root the ability to login.  I want to look more into other useful security software / configs, but am pretty happy with security for now.
    • Web server: installs nginx web server.  There's not much configured right now, but I plan to put some basic web apps on to help manage the server & communicate with devices.
    • SMTP email relay: installs postfix mail transport and configures it to relay mail to an SMTP server like gmail.  This is handy for allowing the server to send emails, like from fail2ban warnings or perhaps notifications of MQTT events.  Surprisingly this was by far the most complex package to automate with Ansible, mostly because installing postfix requires answering some question prompts which are difficult to do with automation.
    One more major package I want to install is a Python WSGI app server that can host web applications written in Python.  I plan to make a few simple web apps for reading and writing MQTT messages and would prefer to write them in Python with the Flask web app framework.  I've looked into some options here and it looks like it comes down to either uWSGI or Gunicorn.  From my testing uWSGI is kind of painful to setup so I might go for Gunicorn.
  • Service Architecture

    Here's a deeper look at the services for the DIY device cloud:

    This diagram shows a simple device cloud 'hello world' application that turns an LED on from a web page over the internet.  You can see in the bottom left a web page which is hosted on the broker server.  When a button is clicked (1) it communicates to a python web application which is hosted by gunicorn (python WSGI web app server) and nginx (main web server).  That web application talks directly to the MQTT broker on the server (2), which will inform any connected clients of MQTT messages.  A device runs a device service which is just a simple python script that talks to the broker (3) and turns on/off the LED with GPIO libraries (4).  The device service is hosted by the excellent supervisor tool so it will automatically run in the background on the device--all the messy details about properly running a background process are abstracted away by supervisor.
    This diagram highlights the two main extension points I see where you can plug in your own code to build your device cloud:
    • Web services, written as python WSGI applications (using a simple framework like flask).
    • Device services, written as simple python scripts that are run automatically by supervisor.
    The ansible deployment scripts for both the broker server and devices take care of setting up all the services, configuring them, etc.  You only need to place your web service and device service code in a certain location and follow a few simple conventions.
    I made a video of creating this simple hello world application and will post it shortly when Youtube is finished processing it.
  • Device Cloud Hello World

    This is a video of how to create a 'hello world' for the device cloud--blinking an LED from a webpage over the internet:

    Source: http://hackaday.io/project/1109/logs

How To Use Celery with RabbitMQ to Queue Tasks on an Ubuntu VPS

Introduction

Asynchronous, or non-blocking, processing is a method of separating the execution of certain tasks from the main flow of a program. This provides you with several advantages, including allowing your user-facing code to run without interruption.
Message passing is a method which program components can use to communicate and exchange information. It can be implemented synchronously or asynchronously and can allow discrete processes to communicate without problems. Message passing is often implemented as an alternative to traditional databases for this type of usage because message queues often implement additional features, provide increased performance, and can reside completely in-memory.
Celery is a task queue that is built on an asynchronous message passing system. It can be used as a bucket where programming tasks can be dumped. The program that passed the task can continue to execute and function responsively, and then later on, it can poll celery to see if the computation is complete and retrieve the data.
While celery is written in Python, its protocol can be implemented in any language. It can even function with other languages through webhooks.
By implementing a job queue into your program's environment, you can easily offload tasks and continue to handle interactions from your users. This is a simple way to increase the responsiveness of your applications and not get locked up while performing long-running computations.
In this guide, we will install and implement a celery job queue using RabbitMQ as the messaging system on an Ubuntu 12.04 VPS.

Install the Components

Install Celery

Celery is written in Python, and as such, it is easy to install in the same way that we handle regular Python packages.
We will follow the recommended procedures for handling Python packages by creating a virtual environment to install our messaging system. This helps us keep our environment stable and not effect the larger system.
Install the Python virtual environment package from Ubuntu's default repositories:
sudo apt-get update
sudo apt-get install python-virtualenv
We will create a messaging directory where we will implement our system:
mkdir ~/messaging
cd ~/messaging
We can now create a virtual environment where we can install celery by using the following command:
virtualenv --no-site-packages venv
With the virtual environment configured, we can activate it by typing:
source venv/bin/activate
Your prompt will change to reflect that you are now operating in the virtual environment we made above. This will ensure that our Python packages are installed locally instead of globally.
If at any time we need to deactivate the environment (not now), you can type:
deactivate
Now that we have activated the environment, we can install celery with pip:
pip install celery

Install RabbitMQ

Celery requires a messaging agent in order to handle requests from an external source. This agent is referred to as a "broker".
There are quite a few options for brokers available to choose from, including relational databases, NoSQL databases, key-value stores, and actual messaging systems.
We will be configuring celery to use the RabbitMQ messaging system, as it provides robust, stable performance and interacts well with celery. It is a great solution because it includes features that mesh well with our intended use.
We can install RabbitMQ through Ubuntu's repositories:
sudo apt-get install rabbitmq-server
The RabbitMQ service is started automatically on our server upon installation.

Create a Celery Instance

In order to use celery's task queuing capabilities, our first step after installation must be to create a celery instance. This is a simple process of importing the package, creating an "app", and then setting up the tasks that celery will be able to execute in the background.
Let's create a Python script inside our messaging directory called tasks.py where we can define tasks that our workers can perform.
sudo nano ~/messaging/tasks.py
The first thing we should do is import the Celery function from the celery package:
from celery import Celery
After that, we can create a celery application instance that connects to the default RabbitMQ service:
from celery import Celery

app = Celery('tasks', backend='amqp', broker='amqp://')
The first argument to the Celery function is the name that will be prepended to tasks to identify them.
The backend parameter is an optional parameter that is necessary if you wish to query the status of a background task, or retrieve its results.
If your tasks are simply functions that do some work and then quit, without returning a useful value to use in your program, you can leave this parameter out. If only some of your tasks require this functionality, enable it here and we can disable it on a case-by-case basis further on.
The broker parameter specifies the URL needed to connect to our broker. In our case, this is the RabbitMQ service that is running on our server. RabbitMQ operates using a protocol called "amqp". If RabbitMQ is operating under its default configuration, celery can connect with no other information other than the amqp:// scheme.

Build Celery Tasks

Still in this file, we now need to add our tasks.
Each celery task must be introduced with the decorator @app.task. This allows celery to identify functions that it can add its queuing functions to. After each decorator, we simply create a function that our workers can run.
Our first task will be a simple function that prints out a string to console.
from celery import Celery

app = Celery('tasks', backend='amqp', broker='amqp://')

@app.task
def print_hello():
    print 'hello there'
Because this function does not return any useful information (it instead prints it to the console), we can tell celery to not use the backend to store state information about this task. This is less complicated under the hood and requires fewer resources.
from celery import Celery

app = Celery('tasks', backend='amqp', broker='amqp://')

@app.task(ignore_result=True)
def print_hello():
    print 'hello there'
Next, we will add another function that will generate prime numbers (taken from RosettaCode). This can be a long-running process, so it is a good example for how we can deal with asynchronous worker processes when we are waiting for a result.
from celery import Celery

app = Celery('tasks', backend='amqp', broker='amqp://')

@app.task(ignore_result=True)
def print_hello():
    print 'hello there'

@app.task
def gen_prime(x):
    multiples = []
    results = []
    for i in xrange(2, x+1):
        if i not in multiples:
            results.append(i)
            for j in xrange(i*i, x+1, i):
                multiples.append(j)
    return results
Because we care about what the return value of this function is, and because we want to know when it has completed (so that we may use the results, etc), we do not add the ignore_result parameter to this second task.
Save and close the file.

Start Celery Worker Processes

We can now start a worker processes that will be able to accept connections from applications. It will use the file we just created to learn about the tasks it can perform.
Starting a worker instance is as easy as calling out the application name with the celery command. We will include a "&" character at the end of our string to put our worker process in the background:
celery worker -A tasks &
This will start up an application, and then detach it from the terminal, allowing you to continue to use it for other tasks.
If you want to start multiple workers, you can do so by naming each one with the -n argument:
celery worker -A tasks -n one.%h &
celery worker -A tasks -n two.%h &
The %h will be replaced by the hostname when the worker is named.
To stop workers, you can use the kill command. We can query for the process id and then eliminate the workers based on this information.
ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill
This will allow the worker to complete its current task before exiting.
If you wish to shut down all workers without waiting for them to complete their tasks, you can execute:
ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill -9

Use the Queue to Handle Work

We can use the worker process(es) we spawned to complete work in the background for our programs.
Instead of creating an entire program to demonstrate how this works, we will explore the different options in a Python interpreter:
python
At the prompt, we can import our functions into the environment:
from tasks import print_hello
from tasks import gen_prime
If you test these functions, they appear to not have any special functionality. The first function prints a line as expected:
print_hello()
hello there
The second function returns a list of prime numbers:
primes = gen_prime(1000)
print primes
If we give the second function a larger range of numbers to check, the execution hangs while it calculates:
primes = gen_prime(50000)
Stop the execution by typing "CTRL-C". This process is clearly not computing in the background.
To access the background worker, we need to use the .delay method. Celery wraps our functions with additional capabilities. This method is used to pass the function to a worker to execute. It should return immediately:
primes = gen_prime.delay(50000)
This task is now being executed by the workers we started earlier. Because we configured a backend parameter for our application, we can check the status of the computation and get access to the result.
To check whether the task is complete, we can use the .ready method:
primes.ready()
False
A value of "False" means that the task is still running and a result is not available yet. When we get a value of "True", we can do something with the answer.
primes.ready()
True
We can get the value by using the .get method.
If we have already verified that the value is computed with the .ready method, then we can use that method like this:
print primes.get()
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 
83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 
173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 
269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 
373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 
467, 479, 487, 491, 499, 503, 509, 521, 523, . . .
If, however, you have not used the .ready method prior to calling .get, you most likely want to add a "timeout" option so that your program isn't forced to wait for the result, which would defeat the purpose of our implementation:
print primes.get(timeout=2)
This will raise an exception if it times out, which you can handle in your program.

Conclusion

Although this is enough information to get you started on using celery within your programs, it is only scratching the surface on the full functionality of this library. Celery allows you to string background tasks together, group tasks, and combine functions in interesting ways.
Although celery is written in Python, it can be used with other languages through webhooks. This makes it incredibly flexible for moving tasks into the background, regardless of your chosen language.

Source: https://www.digitalocean.com/community/tutorials/how-to-use-celery-with-rabbitmq-to-queue-tasks-on-an-ubuntu-vps

Creating a browser based EmonHub log console using python, flask, socketio, MQTT

Written in Python using the flask web framework this example subscribes to a MQTT topic to which logs are posted from emonhub and pushes messages up to the browser using socket.io. The results are displayed in a html box that looks like a ubuntu linux terminal window. The example includes basic session based authentication with a hardcoded username and password see app.py.
Username: demo, password: demo
Being able to view the emonhub log from a browser rather than having to login via SSH could make debugging more convenient.
pythonwebconsole.png

Install

sudo pip install Flask
sudo pip install Flask-SocketIO
sudo pip install mosquitto
Installation of Flask-SocketIO can be slow on a raspberrypi.
see: https://flask-socketio.readthedocs.org/en/latest/
Code is based on the flask-socketio example

Changes to emonhub

In emonhub.py:
In the import section at the top add:
import mosquitto
Just below the import section add:
mqttc = mosquitto.Mosquitto()
mqttc.connect("127.0.0.1",1883, 60, True)

class MQTTLogHandler(logging.Handler):
    def emit(self, record):        
        mqttc.publish("log",record.asctime+" "+record.levelname+" "+record.message)
Lower down, just below the line (line 324):
logger.addHandler(loghandler)
add:
mqttloghandler = MQTTLogHandler()
logger.addHandler(mqttloghandler) 
 
Source:  
https://github.com/emoncms/development/tree/master/Tutorials/Python/socketiowebconsole 

1/11/2015

Web Scraping With Scrapy and MongoDB


scrapy

In this article we’re going to build a scraper for an actual freelance gig where the client wants a Python program to scrape data from Stack Overflow to grab new questions (question title and URL). Scraped data should then be stored in MongoDB. It’s worth noting that Stack Overflow has an API, which can be used to access the exact same data. However, the client wanted a scraper, so a scraper is what he got.

Updated on 01/03/2014 – refactored spider. Thanks, @kissgyorgy.

As always, be sure to review the site’s terms of use/service and respect therobots.txt file before starting any scraping job. Make sure to adhere to ethical scraping practices by not flooding the site with numerous requests over a short span of time. Treat any site you scrape as if it were your own.

Installation

We need the Scrapy library (v0.24.4) along with PyMongo (v2.7.2) for storing the data in MongoDB. You need to install MongoDB as well (not covered).

Scrapy

If you’re running OSX or a flavor of Linux, install Scrapy with pip (with your virtualenv activated):
$ pip install Scrapy
If you are on Windows machine, you will need to manually install a number of dependencies. Please refer to the official documentation for detailed instructions as well as this Youtube videothat I created.
Once Scrapy is setup, verify your installation by running this command in the Python shell:
>>> import scrapy
>>>
If you don’t get an error then you are good to go!

PyMongo

Next, install PyMongo with pip:
$ pip install pymongo
Now we can start building the crawler.

Scrapy Project

Let’s start a new Scrapy project:
$ scrapy startproject stack
This creates a number of files and folders that includes a basic boilerplate for you to get started quickly:
├── scrapy.cfg
└── stack
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

Specify Data

The items.py file is used to define storage “containers” for the data that we plan to scrape.
The StackItem() class inherits from Item (docs), which basically has a number of pre-defined objects that Scrapy has already built for us:
import scrapy


class StackItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
Let’s add some items that we actually want to collect. For each question the client needs the title and URL. So, update items.py like so:
from scrapy.item import Item, Field


class StackItem(Item):
    title = Field()
    url = Field()

Create the Spider

Create a file called stack_spider.py in the “spiders” directory. This is where the magic happens – e.g., where we’ll tell Scrapy how to find the exact data we’re looking for. As you can imagine, this is specific to each individual web page.
Start by defining a class that inherits from Scrapy’s Spider and then adding attributes as needed:
from scrapy import Spider


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]
The first few variables are self-explanatory (docs):
  • name defines the name of the Spider.
  • allowed_domains contains the base-URLs for the allowed domains for the spider to crawl.
  • start_urls is a list of URLs for the spider to start crawling from. All subsequent URLs will start from the data that the spider downloads from the URLS in start_urls.

XPath Selectors

Next, Scrapy uses XPath selectors to extract data from a website. In other words, we can select certain parts of the HTML data based on a given XPath. As stated in Scrapy’s documentation, “XPath is a language for selecting nodes in XML documents, which can also be used with HTML.”
You can easily find a specific Xpath using Chrome’s Developer Tools. Simply inspect a specific HTML element, copy the XPath, and then tweak (as needed):

copy xpath in chrome developer tools

Developer Tools also gives you the ability to test XPath selectors in the JavaScript Console by using $x – i.e., $x("//img"):

test xpath in chrome developer tools

Again, we basically tell Scrapy where to start looking for information based on a defined XPath. Let’s navigate to the Stack Overflow site in Chrome and find the XPath selectors.
Right click on the first question and select “Inspect Element”:

inspect element in chrome developer tools

Now grab the XPath for the <div class="summary">//*[@id="question-summary-27624141"]/div[2], and then test it out in the JavaScript Console:

test xpath in chrome developer tools

As you can tell, it just selects that one question. So we need to alter the XPath to grab allquestions. Any ideas? It’s simple: //div[@class="summary"]/h3. What does this mean? Essentially, this XPath states: Grab all <h3> elements that are children of a <div> that has a class of summary. Test this XPath out in the JavaScript Console.
Notice how we are not using the actual XPath output from Chrome Developer Tools. In most cases, the output is just a helpful aside, which generally points you in the right direction for finding the working XPath.
Now let’s update the stack_spider.py script:
from scrapy import Spider
from scrapy.selector import Selector


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="summary"]/h3')

Extract the Data

We still need to parse and scrape the data we want, which falls within <div class="summary"><h3>. Again, update stack_spider.py like so:
from scrapy import Spider
from scrapy.selector import Selector

from stack.items import StackItem


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="summary"]/h3')

        for question in questions:
            item = StackItem()
            item['title'] = question.xpath(
                'a[@class="question-hyperlink"]/text()').extract()[0]
            item['url'] = question.xpath(
                'a[@class="question-hyperlink"]/@href').extract()[0]
            yield item
`
We are iterating through the questions and assigning the title and url values from the scraped data. Be sure to test out the XPath selectors in the JavaScript Console within Chrome Developer Tools – e.g., $x('//div[@class="summary"]/h3/a[@class="question-hyperlink"]/text()') and $x('//div[@class="summary"]/h3/a[@class="question-hyperlink"]/@href').

Test

Ready for the first test? Simply run the following command within the “stack” directory:
$ scrapy crawl stack
Along with the Scrapy stack trace, you should see 50 question titles and URLs outputted. You can render the output to a JSON file with this little command:
$ scrapy crawl stack -o items.json -t json
We’ve now implemented our Spider based on our data that we are seeking. Now we need to store the scraped data within MongoDB.

Store the Data in MongoDB

Each time an item is returned, we want to validate the data and then add it to a Mongo collection.
The initial step is to create the database that we plan to use to save all of our crawled data. Open settings.py and specify the pipeline and add the database settings:
ITEM_PIPELINES = ['stack.pipelines.MongoDBPipeline', ]

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "stackoverflow"
MONGODB_COLLECTION = "questions"

Pipeline Management

We’ve setup our spider to crawl and parse the HTML, and we’ve set up our database settings. Now we have to connect the two together through a pipeline in pipelines.py.
Connect to Database
First, let’s define a method to actually connect to the database:
import pymongo

from scrapy.conf import settings


class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.Connection(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]
Here, we create a class, MongoDBPipeline(), and we have a constructor function to initialize the class by defining the Mongo settings and then connecting to the database.
Process the Data
Next, we need to define a method to process the parsed data:
import pymongo

from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log


class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.Connection(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Question added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item
We establish a connection to the database, unpack the data, and then save it to the database. Now we can test again!

Test

Again, run the following command within the “stack” directory:
$ scrapy crawl stack
Hooray! We have successfully stored our crawled data into the database:

robomongo

Conclusion

This is a pretty simple example of using Scrapy to crawl and scrape a web page. The actual freelance project required the script to follow the pagination links and scrape each page using the CrawlSpider (docs), which is super easy to implement. Try implementing this on your own, and leave a comment below with the link to the Github repository for a quick code review. Need help? Start with this script, which is nearly complete. Cheers!
You can download the entire source code from the Github repository. Comment below with questions. Thanks for Reading!