How to scrape Instagram with Python
This example will show you how you get the profile from instagram, this is only available for public account. Accounts with that a limited or require approval you can’t get the data from.
For this one to work we need two imports:
import urllib.request
import json
We need urllib for making the http requests, and json for reading the JSON.
We then make a function that will parse the web pages HTML, so we can extract the account information:
def extract_instagram_data(html):
split1 = html.split('window._sharedData =')
if len(split1) != 0:
split2 = split1[1].split('"};')
data = json.loads(split2[0]+'"}')
return data
In the previous example i used BeautifulSoup, but this example is more easy with raw HTML.
For getting the HTML we will do a simple function to retrieve the HTML.
def get_html(url):
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url, headers={'User-Agent': user_agent})
response = urllib.request.urlopen(request)
html = response.read().decode('UTF-8')
return html
For making everything is very easy, three lines of code:
For this example i have used my own Instagram account again:
url = 'https://www.instagram.com/mr.wiese/'
insta = extract_instagram_data(get_html(url))
print(insta)
This one will something like this to the console when running:
{'config': {'csrf_token': 'xxxxxx', 'viewer': None, 'viewerId': None}, 'country_code': 'US', 'language_code': 'en', 'locale': 'en_US', 'entry_data': {'ProfilePage': [{......
Getting the account information is very simple and takes about 5 minuts from start to running. Getting all the pictures from the account is much more tricky, but is can still be done within 10 minuts.
Here is the complete code:
# Import libraries
import urllib.request
import json
def extract_instagram_data(html):
split1 = html.split('window._sharedData =')
if len(split1) != 0:
split2 = split1[1].split('"};')
data = json.loads(split2[0]+'"}')
return data
def get_html(url):
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = urllib.request.Request(url, headers={'User-Agent': user_agent})
response = urllib.request.urlopen(request)
html = response.read().decode('UTF-8')
return html
url = 'https://www.instagram.com/mr.wiese/'
insta = extract_instagram_data(get_html(url))
print(insta)