Thursday, 30 April 2009

Bulk uploading to App Engine is faster than you think

One of the things that really worried me when I started porting the nkill project to App Engine was the speed at which I could upload data to the app engine datastore.

I kept seeing threads indicating that it was a slow and painful process.
Luckily, bulkupload.py isn't bad at all! I suspect that the bottleneck is upload speed. Typically home users have asymnetric bandwidth wherein the download speed is significantly higher than the upload speed which is typically capped at 256-512 kbit/s.

Here are some stats:

2332970 entities in 21112.7 seconds (that's 2.3M in about 5 hours and 110 entities per second.)

I split my input CSV into 10,000 line files and used the following bulkloader.py (SDK 1.2.0) options:

--rps_limit=250 --batch_size=50

I am pretty sure there is room for improvement. I tried to use conservative values to minimize CPU usage and stay under the quota radar (I still managed to get in the red).

The following parameters will affect the speed at which you can upload:

--batch_size= (max 500)
--num_threads= (default 10)
--rps_limit= (default 20)
--http_limit= (default 8)

I'll do a follow-up post since I have several million records to upload. Hopefully I'll find the sweet spot.

K.

http://groups.google.com/group/google-appengine/browse_thread/thread/7c145b66ca06aff1

Tuesday, 28 April 2009

Getting started with nkill beta

  1. Read the slides;
  2. request an invite for the beta;
  3. check out the blog;
  4. follow nkill on twitter;
  5. join the group


Monday, 27 April 2009

Resolving DNS records

We need to resolve all www.DOMAIN and DOMAIN A records, mx records and ns A records.

102,359,087 domains... the set is currently limited to .com, .net and .org. The www.DOMAIN run just finished:


real 931m10.251s
user 722m37.686s
sys 123m5.790s


Not too bad! That's about 110K queries per minute on a single box. Lots of timeout though. The queries are sent to a local DJB dnscache. In the future, I'll probably split the resolve jobs between multiple machines at different ISPs.

Sunday, 26 April 2009

NKill in PC World

Sumner Lemon wrote a piece on NKill entitled: "NKill Aims to Catalog Vulnerabilities of Every Computer".

The article is a bit ambiguous and I should clarify:

One of NKill's objectives is to catalog every referenced public machine or network. Starting with all .com, .net, .org domains, www.DOMAIN, mail exchange records, nameservers, etc. and grab the version banners of the software they are running.

Nkill will be really useful for profiling a target during a security assessment because IP4 transforms are hard to perform without a database. Given an IP4 address, shitty sites like domaintools will tell you which virtual hosts are sharing the same address, that's it and they will charge you a fee for that information. They won't tell you which organisations (domains) are trusting this IP address for their mail, nameservers, etc.

With NKill, when a new vulnerability is discovered (e.g. IIS, postfix, apache, php...) we can instantly known which domains are vulnerable; you can pull that information for a whole country and we can also monitor how long it takes for people to react and patch their boxes.

Roberto from Zone-H told me I am going to make a lot of new friends with this project. I guess I'll have to hire some of our troll moderators from k.com.

Saturday, 25 April 2009

NKill design by Matt Buchanan

Just wanted to make a quick post to give credits to Matt Buchanan for the nkill spider and logo.

This is the third design project I am doing with Matt. Matt did our logo banner for the kugutsumen forums; he also did the Flingtech logo. I contacted Matt after his series of exclusive LBP stickers was released on Little BIG Planetoid.

Matt is a Illustrator / Designer / Web Designer (specializing in character design) from Brighton in the south of the U.K. www.mattbuchanan.co.uk.


Here is an early sketch:

NKill Presentations Slides from HITB 2009 Dubai


HITB 2009 Conference materials
have been uploaded. You can grab my NKill presentation slides here.

You might also want to check out Saumil's and grugq's and Chris Evans' awesome presentations.

Thanks again to Dhillon and the HITB crew!

Issues with Datastore usage


I have 102 million domains + tons of mx records and a records ready to upload to App Engine as soon as this issue is cleared:

Datastore usage ~ 80 times more than expected:


I also think there is something wrong.

I have 2.3M Domain records and the source CSV is only 63 megabytes,
no composite index. The dashboard claims I am using 3GB !?!
(3.03 of 101.00 GBytes)

This is my base expando model:

class Domain(db.Expando):
name = db.StringProperty(required=True, verbose_name='FQDN')
revname = db.StringProperty(verbose_name='Reverse FQDN')
since = db.DateTimeProperty(auto_now_add=True)

I am ready to upload 102M more records, I guess I am going to wait
until this issue is resolved.

Kugutsumen