UKWA Dataflows
How the UK Web Archive worked (c.2023)
This page is nowhere near complete, and may never be so!
This documents my understanding of the UK Web Archive dataflow in mid-2023.
Introduction
blah. Data Lake.
Dataflow
dataflow 1.0
title: "UKWA Crawler Dataflow"
zoom 0.9
height 300
offset 10 0
# Locations where data can be stored:
place internet "Internet"
place w3act "W3ACT"
place pywb "PyWB"
place cdx "CDX Index"
place net "NET"
place crawler "Crawler"
place hadoop "Archival\nStorage"
# Data types and descriptions:
data website "Website" black
data warcs "WARCS" red
data md "Metadata" blue
data w3act "W3ACT" darkblue
data pywb "PyWB" purple
data cdx "CDX" orange
data query "Query" black
data playback "Playback" green
# Events
start website@internet
start w3act@w3act
start pywb@pywb
derive w3act@w3act md@w3act "Export\nDatabase" [0,-1]
move md@w3act md@hadoop "Copy to HDFS"
copy md@hadoop md@crawler "Update\nCrawl Targets"@E
space
copy website@internet website@crawler "Crawl"
space
transform website@crawler warcs@crawler "Package\nWARCs"@N
copy warcs@crawler warcs@hadoop "Copy to\nHDFS"
delete warcs@crawler "Delete\nWARCs"@N
space
derive warcs@hadoop cdx@hadoop "Generate CDX"@N [0,1]
move cdx@hadoop cdx@cdx "Update\nCDX Server"@E
copy cdx@cdx cdx@pywb "Query CDX"
copy warcs@hadoop warcs@pywb "Get WARC"
derive warcs@pywb playback@pywb "Rewrite\nResource"@N [0,1]
move playback@pywb playback@internet "Deliver"@E@0.7
delete warcs@pywb,cdx@pywb " "@S
# And we're done:
end
dataflow 1.0
title: "UKWA Playback Dataflow"
zoom 0.9
height 300
offset 10 0
# Locations where data can be stored:
place internet "Internet"
place w3act "W3ACT"
place pywb "PyWB"
place cdx "CDX Index"
place crawler "Crawler"
place hadoop "Archival\nStorage"
# Domains where locations are maintained:
domain public "Public Network"
domain n45 "Service Network"
domain n1 "Storage Network"
# Data types and descriptions:
data website "Website" black
data pywb "PyWB" purple
data warcs "WARCS" red
data md "Metadata" blue
data w3act "W3ACT" darkblue
data cdx "CDX" orange
data query "Query" black
date response "Response" black
data playback "Playback" green
start query@internet
start pywb@pywb
start warcs@hadoop,md@hadoop
space
derive warcs@hadoop cdx@hadoop "Generate CDX"@N [0,1]
move cdx@hadoop cdx@cdx "Update\nCDX Server"@E
move query@internet query@pywb "Request URL"
copy cdx@cdx cdx@pywb "Query CDX"
copy warcs@hadoop warcs@pywb "Get WARC"
space
derive warcs@pywb playback@pywb "Rewrite\nResource"@S [0,-1]
move playback@pywb playback@internet "Deliver"@E
delete warcs@pywb,cdx@pywb " "
# And we're done:
end
Example block diagram