Prior to starting a historical data migration, ensure you do the following:
- Create a project on our US or EU Cloud.
- Sign up to a paid product analytics plan on the billing page (historic imports are free but this unlocks the necessary features).
- Raise an in-app support request (Target Area: Data Management) detailing where you are sending events from, how, the total volume, and the speed. For example, "we are migrating 30M events from a self-hosted instance to EU Cloud using the migration scripts at 10k events per minute."
- Wait for the OK from our team before starting the migration process to ensure that it completes successfully and is not rate limited.
- Set the
historical_migration
option totrue
when capturing events in the migration.
Migrating data from Matomo is a two step process:
- Exporting data via Matomo's API
- Converting Matomo event data to the PostHog schema and capturing in PostHog
1. Exporting data from Matomo
Matomo provides a full API you can get data from. To get the relevant event data, we make a request to the Live.getLastVisitsDetails
method.
To access the API, you need to create an API token. This can be done in your instance's security settings.
With this token, you can make a request to get your event data and save it as a JSON file. This example gets 10,000 events from the first half of 2024:
import requestsimport jsonbase_url = "https://cool.matomo.cloud/"endpoint = "?module=API&method=Live.getLastVisitsDetails"site_id = "1"period = "range"date = "2024-01-01,2024-07-01"response_format = "JSON"token_auth = "11798cde5ksjadjslc3136dd09"filter_limit = "10000"url = f"{base_url}{endpoint}"payload = {"idSite": site_id,"period": period,"date": date,"format": response_format,"token_auth": token_auth,"filter_limit": filter_limit}response = requests.post(url, data=payload)if response.status_code == 200:print("Request was successful")events = response.json()with open('events.json', 'w') as json_file:json.dump(events, json_file, indent=4)else:print(response.text)print(f"Request failed with status code: {response.status_code}")
To return all rows, set filter limit to -1
. Matomo recommends you only export 10,000 rows at a time. If you have more than that amount, you can use a combination of filter_limit
and filter_offset
to get sets of rows.
2. Converting Matomo event data to the PostHog schema
The schema of Matomo's exported event data is similar to PostHog's schema, but it requires conversion to work with the rest of PostHog's data. You can see details on Matomo's schema in their docs and events and properties PostHog autocaptures in our docs.
The big difference is that Matomo structures their data around visits that contain one or more actions. Matomo actions are similar to PostHog's events. You can go through each visit and convert it to PostHog's schema by doing the following:
- Converting properties like
operatingSystemVersion
to$os_version
. - Omitting properties that aren't relevant like
visitEcommerceStatusIcon
,plugins
,timeSpentPretty
, and many more. Matomo includes many more properties on their events than PostHof does. - Looping through the
actionDetails
, converting:- Properties like
url
to$current_url
- Action types like
action
to event names like$pageview
- Action
timestamp
to an ISO 8601 timestamp
- Properties like
- Add visit properties to action properties
Once this is done, you can capture each action into PostHog using the Python SDK or the capture
API endpoint with historical_migration
set to true
.
Here's an example version of a Python script:
from posthog import Posthogfrom datetime import datetimeimport requestsimport jsonposthog = Posthog('<ph_project_api_key>',host='https://us.i.posthog.com',debug=True,historical_migration=True)key_mapping = {'url': '$current_url','browserName': '$browser','browserVersion': '$browser_version','operatingSystemName': '$os','operatingSystemVersion': '$os_version','deviceType': '$device_type','referrerUrl': '$referrer','city': '$geoip_city_name','region': '$geoip_subdivision_1_name','regionCode': '$geoip_subdivision_1_code','country': '$geoip_country_name','countryCode': '$geoip_country_code','continent': '$geoip_continent_name','continentCode': '$geoip_continent_code','latitude': '$geoip_latitude','longitude': '$geoip_longitude','visitIp': '$ip','siteName': '$host','languageCode': '$browser_language',"referrerUrl": "$referrer"}omitted_keys = ["accountId","iconSVG","idpageview","pageIdAction","pageviewPosition","pageTitle","serverTimePretty","subtitle","timeSpentPretty","timestamp","actions",'browser','browserCode','browserFamily','browserFamilyDescription','browserIcon','countryFlag','deviceTypeIcon','events','experiments','fingerprint','idSite','idVisit','language','lastActionDateTime''location','operatingSystem','operatingSystemCode','operatingSystemIcon','plugins','pluginsIcons','referrerSearchEngineIcon','referrerSocialNetworkIcon','serverDate','serverDatePretty','serverDatePrettyFirstAction','serverTimePretty','serverTimePrettyFirstAction','serverTimestamp','sessionReplayUrl','siteCurrency','siteCurrencySymbol','siteName','totalAbandonedCarts','totalAbandonedCartsItems','totalAbandonedCartsRevenue','totalEcommerceConversions','totalEcommerceItems','totalEcommerceRevenue','userId','visitConverted','visitConvertedIcon','visitCount','visitDuration','visitDurationPretty','visitEcommerceStatus','visitEcommerceStatusIcon','visitLocalHour','visitLocalTime','visitServerHour','lastActionTimestamp','lastActionDateTime','firstActionTimestamp','visitorId','visitorTypeIcon','icon']with open('events.json', 'r') as file:events = json.load(file)for event in events:distinct_id = event.get("userId") or event.get("visitorId")visitProperties = {}for key, value in event.items():if value == '' or value is None:continueelif key in omitted_keys:continueelif key in key_mapping:if key in ['browserVersion', 'longitude', 'latitude']:visitProperties[key_mapping[key]] = float(value)else:visitProperties[key_mapping[key]] = valueelif key == 'actionDetails':# Handle actionDetails separatelycontinueelif key == 'resolution':if value == '':continuevisitProperties['$screen_width'] = int(value.split('x')[0])visitProperties['$screen_height'] = int(value.split('x')[1])elif key == 'referrerKeyword':if value != 'Keyword not defined':visitProperties['referrerKeyword'] = valueelse:visitProperties[key] = valueactionDetails = event.get('actionDetails', [])for index, action in enumerate(actionDetails):actionProperties = {}for key, value in action.items():if value == '' or value is None:continueif key in omitted_keys:continueif key in key_mapping:actionProperties[key_mapping[key]] = valueelse:actionProperties[key] = valueactionProperties.update(visitProperties)ph_event_name = action.get('type')if ph_event_name == 'action':ph_event_name = "$pageview"if ph_event_name == 'event':ph_event_name = action.get('eventAction')if ph_event_name == 'goal':# Goal will be a duplicate of another action, so skip itcontinueph_timestamp = action.get("timestamp")if ph_timestamp:ph_timestamp = datetime.utcfromtimestamp(ph_timestamp)posthog.capture(distinct_id=distinct_id,event=ph_event_name,properties=actionProperties,timestamp=ph_timestamp)