RubyPDF Blog software How to Extract All Google Code Project Names with a Given Label

How to Extract All Google Code Project Names with a Given Label

Google Code will be closed entirely on January 25, 2016, So I want to export some projects I am interested to Github.

For me, I am interested in PDF related projects, I checked it online first, it has 950 projects, so it is 95 pages, I don’t want to click every page, very link to get the project name.
then I wrote a simple shell script to help me,

First download all result pages to local disk.

for i in {0..940..10}; do wget "https://code.google.com/hosting/search?q=label%3Apdf&filter=0&mode=&start=$i" -O $i.html ;done

Grab all project names and combine with the export url

grep '<a href="/p/[a-z|0-9|-]*/">' *|grep -v '<a href="/p/support/">Project Hosting Help</a>'|sed 's/.*\/p\/\([^\/">]*\).*/https:\/\/code.google.com\/export-to-github\/export?project=\1/g'

And you can do more filter with description, label, last update and stars number,

If you want to automatically the export jobs, you need ask casperjs for help.

btw, after some modification, you can run this script under Windows(if you have msys, you do not need modify it at all, or you need download windows version wget, grep and sed)

P.S.
This script works on all kind Google Code Search, feel free to have a try.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.