UKWA Dataflows

How the UK Web Archive worked (c.2023)

This page is nowhere near complete, and may never be so!

This documents my understanding of the UK Web Archive dataflow in mid-2023.

Introduction

blah. Data Lake.

Dataflow

dataflow 1.0
title: "UKWA Crawler Dataflow"
zoom 0.9
height 300
offset 10 0

# Locations where data can be stored:
place internet "Internet"
place w3act "W3ACT"
place pywb "PyWB"
place cdx "CDX Index"
place net "NET"
place crawler "Crawler"
place hadoop "Archival\nStorage"

# Data types and descriptions:
data website "Website" black
data warcs "WARCS" red
data md "Metadata" blue
data w3act "W3ACT" darkblue
data pywb "PyWB" purple
data cdx "CDX" orange
data query "Query" black
data playback "Playback" green


# Events
start website@internet
start w3act@w3act
start pywb@pywb

derive w3act@w3act md@w3act "Export\nDatabase" [0,-1]
move md@w3act md@hadoop "Copy to HDFS"
copy md@hadoop md@crawler "Update\nCrawl Targets"@E
space


copy website@internet website@crawler "Crawl"
space
transform website@crawler warcs@crawler "Package\nWARCs"@N
copy warcs@crawler warcs@hadoop "Copy to\nHDFS"
delete warcs@crawler "Delete\nWARCs"@N

space
derive warcs@hadoop cdx@hadoop "Generate CDX"@N [0,1]
move cdx@hadoop cdx@cdx "Update\nCDX Server"@E 

copy cdx@cdx cdx@pywb "Query CDX" 
copy warcs@hadoop warcs@pywb "Get WARC"
derive warcs@pywb playback@pywb "Rewrite\nResource"@N [0,1]
move playback@pywb playback@internet "Deliver"@E@0.7
delete warcs@pywb,cdx@pywb " "@S

# And we're done:
end
dataflow 1.0
title: "UKWA Playback Dataflow"
zoom 0.9
height 300
offset 10 0

# Locations where data can be stored:
place internet "Internet"
place w3act "W3ACT"
place pywb "PyWB"
place cdx "CDX Index"
place crawler "Crawler"
place hadoop "Archival\nStorage"

# Domains where locations are maintained:
domain public "Public Network"
domain n45 "Service Network"
domain n1 "Storage Network"

# Data types and descriptions:
data website "Website" black
data pywb "PyWB" purple
data warcs "WARCS" red
data md "Metadata" blue
data w3act "W3ACT" darkblue
data cdx "CDX" orange
data query "Query" black
date response "Response" black
data playback "Playback" green

start query@internet
start pywb@pywb
start warcs@hadoop,md@hadoop

space
derive warcs@hadoop cdx@hadoop "Generate CDX"@N [0,1]
move cdx@hadoop cdx@cdx "Update\nCDX Server"@E 

move query@internet query@pywb "Request URL"
copy cdx@cdx cdx@pywb "Query CDX" 
copy warcs@hadoop warcs@pywb "Get WARC"
space
derive warcs@pywb playback@pywb "Rewrite\nResource"@S [0,-1]
move playback@pywb playback@internet "Deliver"@E
delete warcs@pywb,cdx@pywb " "

# And we're done:
end

Example block diagram