Skip to content

Parser for large XML files (specifically Wiktionary XML dumps) which converts them to SQLite database files.

License

Notifications You must be signed in to change notification settings

diegovgsilva95/big-xml-to-sqlite-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About this project

This project is designed to parse large XML files (and by that I really mean large XMLs, as in a Wiktionary XML dump), storing cherry-picked contents using a SQLite database, which is split across one or more parts whose sizes are configurable. It utilizes stream-based reading (createReadStream) and worker thread for parallel processing, reducing both I/O load and CPU load.

Table of contents

Installation and usage

  1. Clone this repo
  2. Install the dependencies using npm install
  3. Create the .env (check .env.example as an example) pointing the project to your downloaded Wiktionary XML Dump, the path where each SQLite database part will be created, the prefix name for each SQLite part, and how big each SQLite part will be.
  4. Run the main process using node index.mjs.
  5. Make yourself a cup of coffee or tea, take a walk outside to touch some grass or whatever, because it will probably take hours to process a 12 GiB XML.

Environment variables

  • WIKI_DUMP_XML: Path to the XML file (a Wiktionary XML dump) to be processed.
  • DB_PATH: Path to the directory where SQLite database parts will be stored.
  • DB_BASENAME: Prefix for each SQLite database file.
  • DB_SIZE Maximum size for each SQLite database file (note that #iB != #B, more below).
  • IGNORE_IDS: (Optional) A text-plain file with IDs (the internal Wiktionary IDs for pages) to be ignored during processing, one per line.

Byte-size format

Filesizes can be expressed in three different formats:

  • A pure number of bytes: the most straightforward way, just count how many bytes do you need. Floating-point numbers aren't accepted. Examples of valid inputs: "42", "1048576" or whatever natural (positive integer) number.
  • Natural number followed by unit: accepted units are B (bytes) as well as both its SI-multiples (powers of 1000: KB, MB, GB, TB and PB, with "B" optional for such units) and binary-multiples (powers of 1024: KiB, MiB, GiB, TiB and PiB, with suffix "iB" required). Optionally, the number and the unit can be separated by a space. Examples of valid inputs: "42 B", "616 MiB", "555GB", "777G", etc...
  • Floating-point number followed by unit (except B): similar to the previous format, except that "B" is not accepted as an unit for floating-point numbers (because the project won't deal with bits, only whole bytes, so half-bytes would be pointless here). The decimal point should be surrounded by numbers from both sides (so inputs such as ".42 KB" or "42. KB" aren't accepted, but "0.42 KB" and "42.0 KB" are). Examples of valid inputs: "7.47MB", "12.3 GiB", etc.

Dependencies

  • chalk: Fancy colors for (certain) terminal outputs (currently, just for error redness)
  • dotenv: Loads .env given that my current Node.js version is a bit old.
  • fast-xml-parser: Amazing XML parsing library. No, it won't parse the entire 12 GB XML (you're asking too much from it as well as from your own computer), the XML needs to be broken (and that's the first purpose of this project)
  • sqlite: Main SQLite driver for Node.js
  • sqlite3: auxiliary library meant to be used with the SQLite driver

Test scripts

  • check_db_counts.sh: a shellscript reminiscent from my iterative development. It simply outputs the overall amount of rows across all SQLite database parts, using sqlite3 program to query the count for each file within db folder, and awk to reduce these counts to a sum. Usage: after running the project and populating the db folder, you can run cd tests; chroot +x check_db_counts.sh; ./check_db_counts.sh.

Limitations

Currently, this project:

  • Creates and writes fresh SQLite database files, without using pre-existing database files (and that's the reason for IGNORE_IDS)
  • Expects tags from Wiktionary XML dump files (particularly <namespaces>, <namespace>, <pages> and <page>)
  • Ignores Wiktionary pages whose namespaces are neither none nor "Category"

As per Murphy's Law, expect possible, specific failures which would need a custom, specific fix, because this project is dealing with really big files whose exact contents are unbeknownst to the project beforehand, and possibly broken from its source (e.g.: when a random bot decided to edit a random Wiktionary page breaking its contents into multiple #text fragments, thus messing with the expected fast-xml-parser's behavior of a single #text key within parsed.revision.text).

Shall you decide to use it and/or have any needs, feel free to adapt the code (see details below).

Disclaimer and licensing

Wiktionary is a product from Wikimedia Foundation. This project is neither affiliated with Wiktionary nor with Wikimedia Foundation.

This project doesn't ship with Wiktionary files, nor downloads them. It's up to you to download any Wiktionary dump files, dealing with the fact that any of YOUR extractions or derivatives that YOU create using this tool with any XML files downloaded from Wiktionary are subject to Wiktionary's licensing terms.

Regarding the code of this project, I took the liberty of choosing WTFNMFPL license (see LICENSE), a fork from WTFPL original license: do whatever you want to do (but it's not my fault).

About

Parser for large XML files (specifically Wiktionary XML dumps) which converts them to SQLite database files.

Topics

Resources

License

Stars

Watchers

Forks