塩焼きブログ

塩焼きに関しての研究内容を公開しています

How to import Wikipedia Data dumps to MySQL

xml2sql install

download xml2sql-0.5.tar.gz from Data dumps/xml2sql.

tar xvfz xml2sql-0.5.tar.gz
cd xml2sql-0.5
./configure
make
sudo make install

Creating a MediaWiki database

download MediaWiki.

tar xvfz mediawiki-1.23.3.tar.gz
mysql -uNAME -p DATABASE < mediawiki-1.23.3/maintenance/tables.sql

Import Wikipedia Data dumps

download latest/jawiki-latest-pages-articles.xml.bz2 from Index of /jawiki/

bzip2 -d jawiki-latest-pages-articles.xml.bz2

to be converted to a format that can be imported XML in xml2sql, but error that occurs unexpected element <foo>.

xml2sql jawiki-latest-pages-articles.xml

and therefore, get rid of rows that match the condition using grep -v.

cat jawiki-latest-pages-articles.xml | grep -v "<dbname>" | grep -v "<dbname.*/>" | grep -v "<ns>" | grep -v "<ns.*/>" | grep -v "<parentid>" | grep -v "<parentid.*/>" | grep -v "<sha1>" | grep -v "<sha1.*/>" | grep -v "<model>" | grep -v "<model.*/>" | grep -v "<format>" | grep -v "<format.*/>" | grep -v "<redirect>" | grep -v "<redirect.*/>" | xml2sql

conditions can be generated by ruby.

# xml2sql.rb
arr = ["dbname", "ns", "parentid", "sha1", "model", "format", "redirect"]
str = "cat jawiki-latest-pages-articles.xml |"
arr.each do |element|
  str += " grep -v \"<#{element}>\" | grep -v \"<#{element}.*\/>\" |"
end
str += " xml2sql"
puts str

import page.txt, text.txt, revision.txt to MySQL.

mysqlimport --fields-terminated-by="\t" --default-character-set=utf8 -uNAME -dLp DATABASE page.txt
mysqlimport --fields-terminated-by="\t" --default-character-set=utf8 -uNAME -dLp DATABASE text.txt
mysqlimport --fields-terminated-by="\t" --default-character-set=utf8 -uroot -dLp mediawiki revision.txt

completes.