Skip to content

Commit

Permalink
Merge pull request #28 from N0taN3rd/chrome-remote-interface-extra
Browse files Browse the repository at this point in the history
Added Chrome remote interface extra capturer and writer
  • Loading branch information
N0taN3rd authored Feb 24, 2019
2 parents 77fc657 + 8a8ea65 commit 093b87c
Show file tree
Hide file tree
Showing 18 changed files with 887 additions and 846 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# Changelog
## 3.3.0 (2019-02-24)
- **Feature**
- Added Request Capturer for [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra)
- Added WARC writer for [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra)

## 3.2.0 (2018-12-28)

- **Feature**
Expand Down
53 changes: 49 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# node-warc
Parse Web Archive (WARC) files or create WARC files using [Electron](https://electron.atom.io/), [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface), [Puppeteer](https://github.com/GoogleChrome/puppeteer), or [request](https://github.com/request/request)
Parse Web Archive (WARC) files or create WARC files using
- [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface)
- [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra)
- [Puppeteer](https://github.com/GoogleChrome/puppeteer)
- [Electron](https://electron.atom.io/)
- [request](https://github.com/request/request)


Run `npm install node-warc` or `yarn add node-warc` to ge started

Expand Down Expand Up @@ -91,7 +97,7 @@ parser.start()

### Examples

#### Using chrome-remote-interface
#### Using [chrome-remote-interface](https://github.com/cyrus-and/chrome-remote-interface)

```js
const CRI = require('chrome-remote-interface')
Expand All @@ -104,6 +110,7 @@ const { RemoteChromeWARCGenerator, RemoteChromeCapturer } = require('node-warc')
client.Network.enable(),
])
const cap = new RemoteChromeCapturer(client.Network)
cap.stopCapturing()
await client.Page.navigate({ url: 'http://example.com' });
// actual code should wait for a better stopping condition, eg. network idle
await client.Page.loadEventFired()
Expand All @@ -121,15 +128,51 @@ const { RemoteChromeWARCGenerator, RemoteChromeCapturer } = require('node-warc')
})()
```

#### Using puppeteer
#### Using [chrome-remote-interface-extra](https://github.com/N0taN3rd/chrome-remote-interface-extra)
```js
const { CRIExtra, Events, Page } = require('chrome-remote-interface-extra')
const { CRIExtraWARCGenerator, CRIExtraCapturer } = require('node-warc')

;(async () => {
let client
try {
// connect to endpoint
client = await CRIExtra({ host: 'localhost', port: 9222 })
const page = await Page.create(client)
const cap = new CRIExtraCapturer(page, Events.Page.Request)
cap.stopCapturing()
await page.goto('https://example.com', { waitUntil: 'networkIdle' })
const warcGen = new CRIExtraWARCGenerator()
await warcGen.generateWARC(cap, {
warcOpts: {
warcPath: 'myWARC.warc'
},
winfo: {
description: 'I created a warc!',
isPartOf: 'My awesome pywb collection'
}
})
} catch (err) {
console.error(err)
} finally {
if (client) {
await client.close()
}
}
})()
```

#### Using [Puppeteer](https://github.com/GoogleChrome/puppeteer)
```js
const puppeteer = require('puppeteer')
const { Events } = require('puppeteer')
const { PuppeteerWARCGenerator, PuppeteerCapturer } = require('node-warc')

;(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
const cap = new PuppeteerCapturer(page)
const cap = new PuppeteerCapturer(page, Events.Page.Request)
cap.stopCapturing()
await page.goto('http://example.com', { waitUntil: 'networkidle0' })
const warcGen = new PuppeteerWARCGenerator()
await warcGen.generateWARC(cap, {
Expand All @@ -150,3 +193,5 @@ const { PuppeteerWARCGenerator, PuppeteerCapturer } = require('node-warc')
The generateWARC method used in the preceding examples is helper function for making
the WARC generation process simple. See its implementation for a full example
of WARC generation using node-warc

Or see one of the crawler implementations provided by [Squidwarc](https://github.com/N0taN3rd/Squidwarc/tree/master/lib/crawler).
54 changes: 39 additions & 15 deletions index.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ import {ReadStream} from "fs";
import {Gunzip} from "zlib";
import {Transform} from "stream";
import {URL} from 'url';
import {Page, Request, CDPSession} from "puppeteer";
import { EventEmitter } from "eventemitter3";
import * as puppeteer from 'puppeteer'

interface Error {
stack?: string;
Expand Down Expand Up @@ -216,22 +216,23 @@ export class ElectronRequestCapturer extends RequestHandler {

export class PuppeteerRequestCapturer {
_capture: boolean;
_requests: Request[];
constructor (page?: Page);
attach (page: Page): void;
detach (page: Page): void;
_requests: Map<number, puppeteer.Request>;
_requestC: number;
constructor (page?: puppeteer.Page, requestEvent?: string = 'request');
attach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
detach (page: puppeteer.Page, requestEvent?: string = 'request'): void;
startCapturing(): void;
stopCapturing(): void;
requestWillBeSent(r: Request): void;
iterateRequests(): Iterator<Request>;
requests(): Request[];
[Symbol.iterator](): Iterator<Request>;
requestWillBeSent(r: puppeteer.Request): void;
iterateRequests(): Iterator<puppeteer.Request>;
requests(): puppeteer.Request[];
[Symbol.iterator](): Iterator<puppeteer.Request>;
}

export class PuppeteerCDPRequestCapturer extends RequestHandler {
constructor (client?: CDPSession);
attach(client: CDPSession);
detach(client: CDPSession);
constructor (client?: puppeteer.CDPSession);
attach(client: puppeteer.CDPSession);
detach(client: puppeteer.CDPSession);
}

export class RemoteChromeRequestCapturer extends RequestHandler {
Expand All @@ -240,6 +241,24 @@ export class RemoteChromeRequestCapturer extends RequestHandler {
detach(cdpClient: object);
}

export type CRIEPage = object
export type CRIERequest = object

export class CRIExtraRequestCapturer {
_capture: boolean;
_requests: Map<number, CRIERequest>;
_requestC: number;
constructor (page?: CRIEPage, requestEvent?: string = 'request');
attach (page: CRIEPage, requestEvent?: string = 'request'): void;
detach (page: CRIEPage, requestEvent?: string = 'request'): void;
startCapturing(): void;
stopCapturing(): void;
requestWillBeSent(r: CRIERequest): void;
iterateRequests(): Iterator<CRIERequest>;
requests(): CRIERequest[];
[Symbol.iterator](): Iterator<CRIERequest>;
}

export type WARCContentData = Buffer | string

export interface WARCFileOpts {
Expand Down Expand Up @@ -302,12 +321,12 @@ export class ElectronWARCGenerator extends WARCWriterBase {

export class PuppeteerWARCGenerator extends WARCWriterBase {
generateWARC (capturer: PuppeteerRequestCapturer, genOpts: WARCGenOpts): Promise<NullableEr>;
generateWarcEntry (request: Request): Promise<void>;
generateWarcEntry (request: puppeteer.Request): Promise<void>;
}

export class PuppeteerCDPWARCGenerator extends WARCWriterBase {
generateWARC (capturer: PuppeteerCDPRequestCapturer, client: CDPSession, genOpts: WARCGenOpts): Promise<NullableEr>;
generateWarcEntry (nreq: CDPRequestInfo, client: CDPSession): Promise<void>;
generateWARC (capturer: PuppeteerCDPRequestCapturer, client: puppeteer.CDPSession, genOpts: WARCGenOpts): Promise<NullableEr>;
generateWarcEntry (nreq: CDPRequestInfo, client: puppeteer.CDPSession): Promise<void>;
}

export class RemoteChromeWARCGenerator extends WARCWriterBase {
Expand All @@ -318,3 +337,8 @@ export class RemoteChromeWARCGenerator extends WARCWriterBase {
export class RequestLibWARCGenerator extends WARCWriterBase {
generateWarcEntry (resp: object): Promise<void>;
}

export class CRIExtraWARCGenerator extends WARCWriterBase {
generateWARC (capturer: CRIExtraRequestCapturer, genOpts: WARCGenOpts): Promise<NullableEr>;
generateWarcEntry (request: CRIERequest): Promise<void>;
}
105 changes: 85 additions & 20 deletions index.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ const {
} = require('./lib/parsers')

const {
CRIExtraWARCGenerator,
ElectronWARCWriter,
PuppeteerCDPWARCGenerator,
PuppeteerWARCGenerator,
Expand All @@ -15,6 +16,7 @@ const {
} = require('./lib/writers')

const {
CRIExtraCapturer,
ElectronCapturer,
PuppeteerCapturer,
PuppeteerCDPCapturer,
Expand All @@ -23,30 +25,93 @@ const {
} = require('./lib/requestCapturers')

/**
* @type {{WARCStreamTransform: WARCStreamTransform, AutoWARCParser: AutoWARCParser, WARCGzParser: WARCGzParser, WARCParser: WARCParser, ElectronWARCWriter: ElectronWARCGenerator, PuppeteerCDPWARCGenerator: PuppeteerCDPWARCGenerator, PuppeteerWARCGenerator: PuppeteerWARCGenerator, RemoteChromeWARCWriter: RemoteChromeWARCGenerator, WARCWriterBase: WARCWriterBase, RequestHandler: RequestHandler, ElectronCapturer: ElectronRequestCapturer, PuppeteerCapturer: PuppeteerRequestCapturer, PuppeteerCDPCapturer: PuppeteerCDPRequestCapturer, RemoteChromeCapturer: RemoteChromeRequestCapturer}}
* @type {AutoWARCParser}
*/
module.exports = {
WARCStreamTransform,
AutoWARCParser,
WARCGzParser,
WARCParser,
ElectronWARCWriter,
PuppeteerCDPWARCGenerator,
PuppeteerWARCGenerator,
RemoteChromeWARCWriter,
WARCWriterBase,
RequestHandler,
ElectronCapturer,
PuppeteerCapturer,
PuppeteerCDPCapturer,
RemoteChromeCapturer
}
exports.AutoWARCParser = AutoWARCParser

/**
* @type {CRIExtraRequestCapturer}
*/
exports.CRIExtraCapturer = CRIExtraCapturer

/**
* @type {CRIExtraWARCGenerator}
*/
exports.CRIExtraWARCGenerator = CRIExtraWARCGenerator

/**
* @type {ElectronRequestCapturer}
*/
exports.ElectronCapturer = ElectronCapturer

/**
* @type {ElectronWARCGenerator}
*/
exports.ElectronWARCWriter = ElectronWARCWriter

/**
* @type {PuppeteerCDPRequestCapturer}
*/
exports.PuppeteerCDPCapturer = PuppeteerCDPCapturer

/**
* @type {PuppeteerCDPWARCGenerator}
*/
exports.PuppeteerCDPWARCGenerator = PuppeteerCDPWARCGenerator

module.exports.RequestLibWARCWriter = require('./lib/writers/requestLib')
/**
* @type {PuppeteerRequestCapturer}
*/
exports.PuppeteerCapturer = PuppeteerCapturer

/**
* @type {PuppeteerWARCGenerator}
*/
exports.PuppeteerWARCGenerator = PuppeteerWARCGenerator

/**
* @type {RemoteChromeRequestCapturer}
*/
exports.RemoteChromeCapturer = RemoteChromeCapturer

/**
* @type {RemoteChromeWARCGenerator}
*/
exports.RemoteChromeWARCWriter = RemoteChromeWARCWriter

/**
* @type {RequestHandler}
*/
exports.RequestHandler = RequestHandler

/**
* @type {WARCGzParser}
*/
exports.WARCGzParser = WARCGzParser

/**
* @type {WARCParser}
*/
exports.WARCParser = WARCParser

/**
* @type {WARCStreamTransform}
*/
exports.WARCStreamTransform = WARCStreamTransform

/**
* @type {WARCWriterBase}
*/
exports.WARCWriterBase = WARCWriterBase

/**
* @type {RequestLibWARCGenerator}
*/
exports.RequestLibWARCWriter = require('./lib/writers/requestLib')

if (require('./lib/parsers/_canUseRecordIterator')) {
/**
* @type {function(ReadStream|Gunzip): AsyncIterator<WARCRecord>}
* @type {function(ReadStream): AsyncIterator<WARCRecord>}
*/
module.exports.recordIterator = require('./lib/parsers/recordterator')
exports.recordIterator = require('./lib/parsers/recordterator')
}
32 changes: 23 additions & 9 deletions lib/parsers/index.js
Original file line number Diff line number Diff line change
@@ -1,17 +1,31 @@
/**
* @type {{AutoWARCParser: AutoWARCParser, WARCGzParser: WARCGzParser, WARCParser: WARCParser, WARCStreamTransform: WARCStreamTransform, GzipDetector: GzipDetector}}
* @type {AutoWARCParser}
*/
module.exports = {
AutoWARCParser: require('./autoWARCParser'),
WARCGzParser: require('./warcGzParser'),
WARCParser: require('./warcParser'),
WARCStreamTransform: require('./warcStreamTransform'),
GzipDetector: require('./gzipDetector')
}
exports.AutoWARCParser = require('./autoWARCParser')

/**
* @type {GzipDetector}
*/
exports.GzipDetector = require('./gzipDetector')

/**
* @type {WARCGzParser}
*/
exports.WARCGzParser = require('./warcGzParser')

/**
* @type {WARCParser}
*/
exports.WARCParser = require('./warcParser')

/**
* @type {WARCStreamTransform}
*/
exports.WARCStreamTransform = require('./warcStreamTransform')

if (require('./_canUseRecordIterator')) {
/**
* @type {function(warcStream: ReadStream|Gunzip): AsyncIterator<WARCRecord>}
*/
module.exports.recordIterator = require('./recordterator')
exports.recordIterator = require('./recordterator')
}
Loading

0 comments on commit 093b87c

Please sign in to comment.