R⁶ — Capturing [YouTube] Captions

(R⁶ == brief, low-expository posts)

@yoniceedee suggested I look at the Cambridge Analytics “whistleblower” testimony proceedings:

I value the resources @yoniceedee tosses my way (they often end me down twisted paths like this one, though 🙂 but I really dislike spending any amount of time on youtube and can consume text context much faster than even accelerated video playback.

Google auto-generated captions for that video and you can display them by clicking below the video on the right and enabling the transcript which slowly (well, in my frame of reference) loads into the upper-right. That’s still sub-optimal since we need to be on the youtube page to read/scroll. There’s no “export” option and my initial instinct was to go to Developer Tools and look for the https://www.youtube.com/service_ajax?name=getTranscriptEndpoint URL and “Copy the Response” to the clipboard and save it to a file then do some JSON/list wrangling (the transcript JSON URL is in the snippet below):

library(tidyverse)

trscrpt <- jsonlite::fromJSON("https://rud.is/dl/ca-transcript.json")

runs <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$formattedStartOffset$runs
cues <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$cues

data_frame(
  mark = map_chr(runs, ~.x$text),
  text = map_chr(cues, ~.x$transcriptCueRenderer$cue$runs[[1]]$text)  
) %>% 
  separate(mark, c("minute", "second"), sep=":", remove = FALSE, convert = TRUE) 
## # A tibble: 3,247 x 4
##    mark  minute second text                                    
##    <chr>  <int>  <int> <chr>                                   
##  1 00:00      0      0 all sort of yeah web of things if it's a
##  2 00:02      0      2 franchise then there's a kind of        
##  3 00:03      0      3 ultimately there's a there's a there's a
##  4 00:05      0      5 coordinator of that franchise or someone
##  5 00:07      0      7 who's a you got a that franchise is well
##  6 00:09      0      9 well when I was there that was Alexander
##  7 00:13      0     13 Nixon Steve banning but that's that's a 
##  8 00:16      0     16 question you should be asking aiq yeah  
##  9 00:18      0     18 yeah and just got to a IQ and the GSR   
## 10 00:24      0     24 state from gts-r that's other Hogan data
## # ... with 3,237 more rows

But, then I remembered YouTube has an API for this and threw together a quick script to grab them that way as well:

# the API needs these scopes

c(
  "https://www.googleapis.com/auth/youtube.force-ssl",
  "https://www.googleapis.com/auth/youtubepartner"
) -> scope_list

# oauth dance

httr::oauth_app(
  appname = "google",
  key = Sys.getenv("GOOGLE_APP_SECRET"),
  secret = Sys.getenv("GOOGLE_APP_KEY")
) -> captions_app

httr::oauth2.0_token(
  endpoint = httr::oauth_endpoints("google"),
  app = captions_app,
  scope = scope_list,
  cache = TRUE
) -> google_token

# list the available captions for this video
# (captions can be in one or more languages)

httr::GET(
  url = "https://www.googleapis.com/youtube/v3/captions",
  query = list(
    part = "snippet",
    videoId = "f2Sxob3fl0k" # the v=string in the YouTube URL
  ),
  httr::config(token = google_token)
) -> caps_list

# I'm cheating since I know there's only one but you'd want
# to introspect `caps_list` before blindly doing this for 
# other videos.

httr::GET(
  url = sprintf(
    "https://www.googleapis.com/youtube/v3/captions/%s",
    httr::content(caps_list)$items[[1]]$id
  ),
  httr::config(token = google_token)
) -> caps

# strangely enough, the JSON response "feels" better than this
# one, though this is a standard format that's parseable quite well.

cat(rawToChar(httr::content(caps)))
## 0:00:00.000,0:00:03.659
## all sort of yeah web of things if it's a
## 
## 0:00:02.490,0:00:05.819
## franchise then there's a kind of
## 
## 0:00:03.659,0:00:07.589
## ultimately there's a there's a there's a
## 
## 0:00:05.819,0:00:09.660
## coordinator of that franchise or someone
## 
## 0:00:07.589,0:00:13.139
## who's a you got a that franchise is well
## 
## 0:00:09.660,0:00:16.230
## well when I was there that was Alexander
## ...

Neither a reflection on active memory nor a quick Duck Duck Go search (I try not to use Google Search anymore) seemed to point to an existing R resource for this, hence the quick post in the event the snippet is helpful to anyone else.

If you do know of an R package/snippet that does this already, please shoot a note into the comments so others can find it.



This is a Security Bloggers Network syndicated blog post authored by hrbrmstr. Read the original post at: rud.is