User:Visviva/Bash
From Wikipedia, the free encyclopedia
I'm fairly new to Bash, but if these scripts are of any use to you please feel free to use & adapt them.
If you think you can improve anything on this page, please share your ideas either here or on the Talk page.
[edit] Uncat.sh
I find that this only processes about 300,000 lines per hour on my desktop machine. It would therefore take about 400 hours to process the entire text of Wikipedia.
#!/bin/bash
#This is a bash script for extracting files from a EN Wikipedia XML dump.
#This script takes one argument, the name of the file it will process.
#If you know a way to make this script faster, please share.
#Make a special pipe for the file
exec 3< $1
in=0
cat=0
#Start
while read <&3 line; do
#Scan for categories
if [ "$in" -eq "1" ]
then
case $line in
*[[Category:* | *[[category:* | *REDIRECT* | *redirect* | *disambig*
| *dis}}* | *CC}}* | *Disambig* | *Redirect* )
in=0;;
esac
fi
#Scan for title -- also tells us if the last page is over
title=""
title=" $(echo $line | grep '<title>')"
if [ "$title" != " " ]
then
oldtitle=$PAGE_TITLE
title=$(echo $line | grep '<title>' | sed -e s'@<title>\(.*\)</title>@\1@
')
export PAGE_TITLE=$title
if [ "$in" -eq "1" ]
then
echo "*[[$oldtitle]]"
fi
in=1
case $title in
*deletion* | *Deletion* )
in=0;;
esac
fi
done

