Using Tor in Python scripts

Figure 1: DNS leaks - borrowed from dnsleaktest.comI was building a tool in Python recently which had to connect to the Internet over the Tor network. As I learned some valuable lessons, I wanted to share them in this post. As Tor is actually a SOCKS5 proxy, it is quite easy to tunnel traffic over it. The catch is to prevent the local machine to perform the DNS queries, as those would leak important privacy information.


The whole process comes in three steps:



  1. Switch main socket object with the socks socket - this new socks socket will be used in downstream libraries as the base and will provide SOCKS support data

  2. Switch socket getaddrinfo function to prevent DNS leakage

  3. import any downstream modules that will be using the customized sockets (like urllib or requests)


The code


The following listing shows the simple snippet which sets up the Python script to connect over Tor.


#!/usr/bin/python

import socks
import socket

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket

def getaddrinfo(*args):
    return [(socket.AF_INET, socket.SOCK_STREAM, 6, '', (args[0], args[1]))]

socket.getaddrinfo = getaddrinfo

import requests

headers = {
    'User-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
    'referer': 'https://www.google.com'
}

print requests.get('https://check.torproject.org/', headers=headers).text


Headers can be removed or changed if needed. Also, make sure to check the Tor port. Tor works on port 9050, but a Tor browser runs the Tor on port 9150.


The getaddrinfo function


The getaddrinfo function is used to retrieve the details required for establishing the connection with a remote site. This includes the site's IP address. In many cases, this is done using the DNS request. This will definitely leak the source IP address and subvert any privacy offered by the Tor project.


To handle this, the original socket's getaddrinfo function is monkeypatched to our own implementation of the same method with one big difference: in the place where the original function would return the IP address obtained from the returned DNS record, we return original host name.As the socket requires the IP address to successfully connect, the question arises how does this work? The answer is the SOCKS5 proxies know how to do DNS querying for themselves, so the queries will never leave the actual host computer thus leaving the original IP.


Checking if our application works


Tor Project has created a web page which can be queried to see if the Tor is using properly. The problem is that this page can't tell you if your application is leaking DNS data. The leaked DNS data leaves your machine and goes to your main DNS server, without ever reaching the Tor check page. That is why you have to make sure that your application is not leaking that data yourself.


The best way to check this is to use Wireshark. When the getaddrinfo function is not monkeypatched, you should see a DNS record going out to your router address (it will send the request forwards if needed). If the getaddrinfo is monkeypatched, you should see no DNS records whatsoever.


Conclusion


This simple snippet will be useful if you need your tools going through a SOCKS proxy or Tor network. Just be careful about the leaks.

python, tor, SOCKS, tools