# Transcriptator

[![GitHub forks](https://img.shields.io/github/forks/stevencrader/transcriptator.svg?style=social&label=Fork&maxAge=2592000)](https://github.com/stevencrader/transcriptator/network/) [![GitHub stars](https://img.shields.io/github/stars/stevencrader/transcriptator.svg?style=social&label=Star&maxAge=2592000)](https://github.com/stevencrader/transcriptator/stargazers/)
[![npm](https://img.shields.io/npm/v/transcriptator)](https://www.npmjs.com/package/transcriptator) [![npm](https://img.shields.io/npm/v/transcriptator?label=yarn)](https://yarnpkg.com/package?name=transcriptator) [![install size](https://packagephobia.com/badge?p=transcriptator)](https://packagephobia.com/result?p=transcriptator) ![License](https://img.shields.io/badge/License-MIT-blue.svg) [![Number of Contributors](https://img.shields.io/github/contributors/stevencrader/transcriptator?style=flat&label=Contributors)](https://github.com/stevencrader/transcriptator/graphs/contributors)
[![Issues opened](https://img.shields.io/github/issues/stevencrader/transcriptator?label=Issues)](https://github.com/stevencrader/transcriptator) [![PRs open](https://img.shields.io/github/issues-pr/stevencrader/transcriptator?label=Pull%20Requests)](https://github.com/stevencrader/transcriptator/pulls) [![PRs closed](https://img.shields.io/github/issues-pr-closed/stevencrader/transcriptator?label=Pull%20Requests)](https://github.com/stevencrader/transcriptator/pulls?q=is%3Apr+is%3Aclosed) [![codecov](https://codecov.io/gh/stevencrader/transcriptator/branch/master/graph/badge.svg?token=KZMGXY8LIH)](https://codecov.io/gh/stevencrader/transcriptator)

Library for converting the various transcript file formats to a common format. Originally designed to help users of the [Podcast Namespace](https://github.com/Podcastindex-org/podcast-namespace/blob/main/docs/1.0.md#transcript) `podcast:transcript` tag. ## Installation This is a Node.js module available through npm or yarn. ### Using npm: ```bash npm install transcriptator ``` ### Using yarn: ```bash yarn add transcriptator ``` ### Using CDN: [transcriptator jsDelivr CDN](https://www.jsdelivr.com/package/npm/transcriptator) ## Usage There are three primary methods and two types. See the jsdoc for additional information. The `convertFile` function accepts the transcript file data and parses it in to an array of `Segment`. If `transcriptFormat` is not defined, will use `determineFormat` to attempt to identify the type. convertFile(data: string, transcriptFormat: TranscriptFormat = undefined): Array The `determineFormat` function accepts the transcript file data and attempts to identify the `TranscriptFormat`. determineFormat(data: string): TranscriptFormat The `TranscriptFormat` enum defines the allowable transcript types supported by Transcriptator. The `Segment` type defines the segment/cue of the transcript. ### Custom timestamp formatter To change the way the `startTime` and `endTime` are formatted in `startTimeFormatted` and `endTimeFormatted`, register a custom formatter to be used instead. The formatter function shall accept a single argument as a number and return the value formatted as a string. ```javascript import { TimestampFormatter } from "transcriptator" function customFormatter(timestamp) { return timestamp.toString() } TimestampFormatter.registerCustomFormatter(customFormatter) ``` ### Options for segments Additional options are available for combining or formatting two or more segments To change the options, use the `Options.setOptions` function. The options only need to be specified once and will be used when parsing any transcript data. To restore options to their default value, call `Options.restoreDefaultSettings`. The `IOptions` interface used by `Options` defines options for combining and formatting parsed segments. - `combineEqualTimes`: boolean - Combine segments if the `Segment.startTime`, `Segment.endTime`, and `Segment.speaker` match between the current and prior segments - Can be used with `combineSegments`. The `combineEqualTimes` rule is applied first. - Can be used with `speakerChange`. The `speakerChange` rule is applied last. - Cannot be used with `combineSpeaker` - Default: false - `combineEqualTimesSeparator`: string - Character to use when `combineEqualTimes` is true. - Default: `\n` - `combineSegments`: boolean - Combine segments where speaker is the same and concatenated `body` fits in the `combineSegmentsLength` - Can be used with `combineEqualTimes`. The `combineSegments` rule is applied first. - Can be used with `speakerChange`. The `speakerChange` rule is applied last. - Cannot be used with `combineSpeaker` - Default: false - `combineSegmentsLength`: number - Max length of body text to use when `combineSegments` is true - Default: See `DEFAULT_COMBINE_SEGMENTS_LENGTH` - `combineSpeaker`: boolean - Combine consecutive segments from the same speaker. - Note: If this is enabled, `combineEqualTimes` and `combineSegments` will not be applied. - Warning: if the transcript does not contain speaker information, resulting segment will contain entire transcript text. - Default: false - `speakerChange`: boolean - Only include `Segment.speaker` when speaker changes - May be used in combination with `combineSpeaker`, `combineEqualTimes`, or `combineSegments` - Default: false ```javascript import { Options } from "transcriptator" Options.setOptions({ combineSegments: true, combineSegmentsLength: 32, }) ``` ## Supported File Formats ### SRT Transcripts which follow the SRT/SubRip format ```text 1 00:00:00,780 --> 00:00:06,210 Adam Curry: podcasting 2.0 March 4 2023 Episode 124 on D flat 2 00:00:06,210 --> 00:00:12,990 formable hello everybody welcome to a delayed board meeting of ``` The timestamp may contain the hour and minutes but is not required. The millisecond may be separated with either a comma or decimal. Attempts to find the speaker's name from the beginning of the first line of each segment. References: - https://en.wikipedia.org/wiki/SubRip ### HTML HTML data in format below are considered to be transcripts. The elements `cite`, `time`, and `p` are used to define a segment. The `cite` element is not required. The order is also not required. The elements may either be a child of the document directly or a direct child of the `html` or `body` element. Elements do not need to be on separate lines. **Example 1** ```html Alban: 0:00

It is so stinking nice to like, show up and record this show. And Travis has already put together an outline. Kevin's got suggestions, I throw my thoughts into the mix. And then Travis goes and does all the work from there, too. It's out into the wild. And I don't see anything. That's an absolute joy for at least two thirds of the team. Yeah, I mean, exactly.

Kevin: 0:30

You guys remember, like two months ago, when you were like, We're going all in on video Buzzcast. I was like, that's, I mean, I will agree and commit and disagree, disagree and commit, I'll do something. But I don't want to do this.

``` **Example 2** ```html

It is so stinking nice to like, show up and record this show. And Travis has already put together an outline. Kevin's got suggestions, I throw my thoughts into the mix. And then Travis goes and does all the work from there, too. It's out into the wild. And I don't see anything. That's an absolute joy for at least two thirds of the team. Yeah, I mean, exactly.

0:00

You guys remember, like two months ago, when you were like, We're going all in on video Buzzcast. I was like, that's, I mean, I will agree and commit and disagree, disagree and commit, I'll do something. But I don't want to do this.

0:30 ``` ### JSON JSON data in one of the formats below are considered to be transcripts. In both formats, the data does not need to be in pretty print format. **Format 1** ```json { "version": "1.0.0", "segments": [ { "speaker": "Alban", "startTime": 0.0, "endTime": 4.8, "body": "It is so stinking nice to" }, { "speaker": "Alban", "startTime": 0.0, "endTime": 4.8, "body": "like, show up and record this" } ] } ``` There must be a `segments` list of objects containing `speaker`, `startTime`, `endTime`, and `body`. The `startTime` and `endTime` are assumed to be in seconds. **Format 2** ```json [ { "start": 1, "end": 5000, "text": "Subtitles: @marlonrock1986 (^^V^^)" }, { "start": 25801, "end": 28700, "text": "It's another hot, sunny day today\nhere in Southern California." } ] ``` The top level element must be a list of objects containing `start`, `end`, and `text`. The `start` and `end` are assumed to be in milliseconds. Attempts to find the speaker's name from the beginning of the `text` value. ### WebVTT Transcripts which follow the WebVTT/VTT format ``` WEBVTT 1 00:00:00.001 --> 00:00:05.000 Subtitles: @marlonrock1986 (^^V^^) 2 00:00:25.801 --> 00:00:28.700 It's another hot, sunny day today here in Southern California. ``` The index number is optional: ``` WEBVTT 00:00:00.000 --> 00:00:11.840 Buenas, bienvenidas de vuelta a KDE Express. Esta vez para no perder el ritmo volvemos a la 00:00:11.840 --> 00:00:16.800 versión movilidad que no tenemos a los compañeros disponibles y hoy quería haceros un especial ``` The timestamp may contain the hour and minutes but is not required. The millisecond may be separated with either a comma or decimal. Attempts to find the speaker's name from the beginning of the first line of each segment. References: - https://www.w3.org/TR/webvtt1/ - https://en.wikipedia.org/wiki/WebVTT ## Test Transcripts Transcripts used for testing are excerpts from the following shows. - [Podcasting 2.0](https://podcastindex.org/podcast/920666) - podcasting_20_episode_124.srt (from Episode 124) - [Buzzcast](https://buzzcast.buzzsprout.com/231452/9092843) - buzzcast.html - buzzcast.srt - buzzcast.json - [How to Start a Podcast](https://feeds.buzzsprout.com/1/2562823/) - how_to_start_a_podcast.json - how_to_start_a_podcast.html - [Podnews Daily (2024-01-25)](https://podnews.net/update/nz-podcast-summit-2024) - podnews_daily_2024-01-25.vtt - [Podnews Weekly Review (2023-03-17)](https://feeds.buzzsprout.com/1538779/12458004/) - podnews_weekly_review_2023-03-17.html - [Podnews Weekly Review (2023-05-05)](https://feeds.buzzsprout.com/1538779/12782529/) - podnews_weekly_review_2023-05-05.json - [subtitle.js](https://github.com/gsantiago/subtitle.js) - LaLaLand.vtt - LaLaLand.json - [KDE Express](https://kdeexpress.gitlab.io/posts/kdeexpress/16-kde-express/) - kde_express-16_kde_en_telegram.vtt ## Contributing Please see the [Contribution Guide](Contributing.md)