UROP Goals

I hope the following will give you an insight into what I’m studying and where I aim to go with it.

Overall Goal for the next 10 weeks: To be of use the Signposts Project and learn some useful skills on the way

Skill Required:

  1. A solid knowledge of computer networking (in particular protocols at the network and transport layers)
  2. Familiarly and practical experience with Ocaml and (in particular Lwt and using sockets)
  3. Experience with Java on Android and the platform

How I hope to achieve a solid knowledge of computer networking ?

  • review the material from my computer networking course, mostly by re-reading Kurose, J.F. & Ross, K.W. (2009). Computer networking: a top-down approach. Addison-Wesley (5th ed.), redoing past exam questions / supervision work, doing the Wireshark Lab sheets and working through the hands on material
  • Gain a deeper knowledge of particular areas by reading RFC and academic papers, ensuring that I alway take notes so that its easy to refer back to material without re-reading all of the material
  • Get update with the latest in online anonymity and security including projects/topics such as Tor, FreedomBox, Privoxy, Hamachi, Tails, Backtrack, SSL, DNSSEC, FireFox addons like NoScript and HTTPS Everywhere etc…

How I hope to achieve familiarly and practical experience of Ocaml

  • Working through Unix system programming in ocaml by Xavier Leroy and Didier Rémy
  • Keeping upto date with Ocaml project & questions on StackOverflow
  • Build Ocaml code

RFC 1034 – An insight into the 1980’s

If you think that RFC make dry reading, maybe just maybe you just haven’t given them enough of a chance yet. Today, I begin reading my first RFC, I choose RFC 1034 and I can genuinely say that I’ve found much more interesting than I was expect, in fact I’ve struggled to put it down (though I suspect that says more about me than RFCs) . RFC 1034 was written 1987 (before I was even born) and offers an insight to a world before DNS and how its designed envisioned DNS’s future use. Below are my notes on the RFC so far (this is likely to be the first collection of notes out of 3). Sorry for the bullets but its my preferred way to take notes

–>

I’ve located the Request for comments (RFC) relating to DNS from the following page http://www.zoneedit.com/doc/rfc/. The RFC that I am going to begin with reading is RFC 1034. This RFC is developed in November 1987 and it titled “Domain Names – Concepts & Facilities”

–>

Historical Context

  • Originally per host file in /etc/hosts for hostname to IP address mapping was updated via FTP.
  • The bandwidth required to update this per host file was proportional to number of host squared, hence exponential grow of hosts mean new system was required
  • The first operational packet switching network was the Advanced Research Projects Agency Network (ARPANET)
  • When changes to the per host file were make, they took along time to propagate to all hosts

Common Ideas from Proposed Solutions

  • Hierarchical name space with the hierarchy corresponding to organisations structures
  • Using a full stop “.” to mark the boundary between hierachy levels

DNS Design Goals

  • Consistent name space where names do not include network identifiers or addresses
  • Distributed management with local caching
  • Minimising attempts to collect consistent copies of entire database as this is not scalable
  • Source (not client) control tradeoff between cost, speed and accuracy of caches
  • All data associated with a name is tagged with a type
  • Name server translations should be independent of communication system that carriers them
  • The system should be useful across a wide spectrum of hosts

Usage Assumptions

  • Initially total database size is proportional to number of hosts but this will grow in time
  • System wide data will change slowly, subnet wide data will change rapidly
  • Each organisation is responsible for its own domains including providing redundant name servers
  • Clients use trusted name server before name server outside the “trusted set”
  • Availability is more important than consistency so its ok that updates are not simultaneous
  • Believe old data when new data is not yet avaliable
  • Copies of translations are distributed with time-outs for refreshing, the distributor sets time-out value
  • When a name server is presented with a query that can only be answered by some other server, either a recursive (name server asks another server to do query) or iterative (name server returns address of another name server to ask instead). The iterative approach is preferred but the recursive approach is still an option
  • All data originates from master files keep on hosts that use the DNS
  • All master files are updated by local sysadmin and all master files are text files that are read by a local name server
  • User programs access name servers through standard programs called resolvers
  • The standard format of master files allows them to be exchanged between hosts
  • Sysadmin provides zone boundaries, master files of data, updates of master fules and refreshing policies
  • DNS provides standard formats for resource data, standard method of querying the database, standard methods for name server to refresh local data from foreign name servers

DNS & Resource Records (RR)

  • The names in the name space will have a tree structure and RR hold the data associated with each name
  • Each node and leaf on the DNS tree has a set of information associated with it and a DNS query aims to extract this information
  • A DNS query includes the name of interest and the type of information that is required

Name Servers

  • These are server programs which hold information about the DNS tree structure and information set
  • A name server may cache information on DNS tree structure or information associated with particular names
  • In general, a name server has complete information from a subset of the DNS tree leaves and then pointers to other name server that can be used to get information on the other areas of the DNS tree
  • A name server has authority for the area of the DNS tree for which it has complete information
  • Authotitative information is organized into units called zones

Resolvers

  • Resolvers are programs that extract information from the name servers in response to client queries
  • All resolvers must have access to at least one name server
  • Typically a resolver is a system routine that is accessible to user programs

DNS from a client prospective

  • DNS is accessed via a procedure call to the local resolver
  • The DNS consists of a single tree and the user can request information from any section of the tree

DNS from the resolvers prospective

  • DNS is composed of an unknown number of name servers
  • Each name server hold at least one piece of information from the DNS tree
  • The DNS tree is static

DNS from the name servers prospective

  • DNS is divided into zones
  • The name server has local copies of some zones
  • The name server must periodically refresh its zones from master copies in local files or foreign name server

Name Specification

  • The term “node” is used to refer to both interior nodes and leaves
  • Each node has a resource set, though this can be empty
  • Each node has a label, 0 to 63 octets (an octet is 8 bits) in length
  • Brother nodes may not have the same label
  • The same label can be used by nodes that are not brothers
  • The root label is null
  • The domain name of a node (not the same as the label for a node) is the list of labels on the path from the node to the tree root
  • The label that compose the domain name are read from (most specific) left to (most general) right, e.g. consider the structure of the DNS name for college computers as342.priv.queens.cam.ac.uk
  • Domain names are case-insenstive
  • In domain names the labels are separated by dots (.)
  • A domain name ending in a dot represents a absolute/complete domain name
  • A domain name not ending in a dot represents an incomplete domain name that should be completed by local software (aka relative domain name)
  • The maximum domain name length is 255 octets (set of 8 bits), this includes labels and label lengths
  • A domain is identified by a domain name and consists of the tree at and below that domain name
  • A domain is a sub domain of another domain if it is contained within that domain
  • You can test if a domain is a sub-domain of another, if the domain name is the within the end of the sub-domain e.g. cam.ac.uk is a sub-domain of uk and queens.cam.ac.uk is a subdomain of ac.uk
  • Labels must follow the rules for ARPANET host names

Email

  • The usually format of email addresses is localpart@maildomain, this is mapped to a domain name so that it can be looked up in the domain name tree by converting localpart into a single label (ignoring dots) then converting maildomain into a domain name (not ignoring dots) and replacing @ with a dot to form a domain name so the email address as342@cam.ac.uk will become as342.cam.ac.uk
  • Domain names ( and therefore email addresses) are case insensative

Resource Records (RR)

  • Each node in the tree has a set of resource information, this can be empty
  • The set of resource information is composed of the set of Rrs
  • Order of RRs is not significant and need not be preserved
  • The owner of RR is the domain name where the RR is found, this is normally inplicit rather than being an explicit part of a RR
  • The type of a RR is an encoded 16 value such as:
A – a host address
CNAME – the canonical name of an alias
HINFO – CPU & OS used by a host
MX – mail exchange for domain
NS – authoritative name server for the domain
PTR – pointer to another part of domain name space

SOA – start of zone of authority

  • The class of a RR in an encoded 16 bit value such as IN – internet system or CH – Chaos System
  • The TTL of a RR is the time to live of the RR, this is a 32 bit integer in seconds, this defines how long a RR can be cached for before discarding, this limit does not apply to authoritative data in zones (TTL of 0 will prevent caching)
  • The RDATA of an RR contains different values depending on the type of class of the RR such as:
  • If type is A and class is IN, then RDATA is a 32 bit IP address
  • If type is A and class is CH, then RDATA is a domain name followed by a 16 bit octal Chaos address
  • If type is CNAME, then RDATA is a domain name
  • If type is MX, then RDATA is a 16 bit preference values followied by a host name willing to act as a mail exchange
  • If type is NS, then RDATA is a host name
  • If type is PTR, then RDATA is a domain name
  • If type is SOA, then RDATA is several fields
  • The structure of an RR is as follows:
  • start of line gives the owner of the RR, next we list the TTL, the type and the class of the RR (dig giving results in different order)
  • Domain names in RR’s could point at the primary name not another alias to avoid extra indirection
Queries
  • queries are carried in UDP or over TCP connection
  • responses to queries are either the answer to the question, refers the requester to another set of name server or signals some error condition
  • in general the resolver deals with queries on the users behalf
  • DNS queries and responses have a standard format, the start is a opcode which is one for a standard query
  • Then there are 4 section in a DNS query/response:
  • Question – Query name and other parameters
  • Answer – Carries RR’s which directly answer the query
  • Authority – Carries Rrs which describe other authoritative servers
  • Additional – Carries Rrs which may be helpful in using the Rrs in the other sections
  • A standard query specifies a target domain name (QNAME), query type (QTYPE) and query class (QCLASS)
  • QTYPE field may specify a specific type, AXFR special zone transfer QTYPE, MAILB matches all mailbox related Rrs or * which (like the regular expression) matchs all RR types
  • QCLASS field may contain a specific type or * to match all RR classes

Getting Started with Python

Why have I chosen to learn Python ?

  • Its a language sometimes used by my summer research project
  • It works particularly better than other language on the Raspberry Pi
  • I’ve covered the basics of Python before but never gone further than simple syntax
  • Its open source
  • It can be used as a scripting language, which is an area that I really want to improve
  • It supports object-oriented programming so I feel that with my previous experience of Java and C++, I will have covered the most popular OOP languages

Python 2.7 or 3 ?

The first decision that I needed to make before learning Python was whether to learn Python 2.7 or 3, since Python 3 fixes some of the “non-optional” design decision found in 2.7 and it of course more update I decided to learn Python 3. The potential problems with this is the lack of backwards compatibility with 2.7 and lack of teaching materials.

Python in my degree ?

The lecture course which is most likely to have included Python is Concepts of Programming Language which I took in Easter term this year. The course makes a few references to the language but does not cover the language to the degree that languages such as Fortran, Lisp, Algol, Pascal, Simula, Smalltalk, SML and Scala were covered.

What programming paradigms does Python support ?

Python advertises itself as a multi-paradigm programming language and from what I can see the range of supported paradigms supports this claim. This makes Python a multi-purpose language along with languages such as Java, C, C++ and SML compared to special-purpose languages such as SQL and LaTeX. The paradigm which Python support include:

  • imperative programming – describing computation by prescribing how to move the program from one state to another, imperative languages included Fortran, C and assembly
  • object-oriented programming – it supports dynamic lookup, abstraction, subtyping and inheritance, OOP language include Java, C++, Simula, Smalltalk and Scala
  • Functional programming – treats computation as the evaluation of mathematical function and avoids state and mutable data. Examples include Lisp, Ocaml, Haskell and Scala
  • Reflection programming – allowing a program to examine and modify its structure at runtime. Like Java, ML and Haskell
  • Structured programming – subroutines, blocks, for and while making code more readable compared to simple tests and goto’s. Examples include most modern languages except assembly code

 How does Python handle types?

Type checking is the process of attempting to prevent type errors by ensuring that the operations in a programm are applied properly. There are two common types of type checking:

  • Dynamic (run-time) type checking – the compiler generates code such that whenever an operation is performed when the program is ran, the code checks to make sure that the operands have the correct types. Examples of dynamically type checked languages are MATLAB, Prolog and Ruby
  • Static (compile-time) type checking – the compilers checks that program text for potential type errors as the program is compiled. Examples of statically type checked languages are  Fortran, Haskell, Java and Ocaml
Python uses Duck typing

Python is dynamically type checked, in fact it uses duck typing. Duck typing means that valid semantics of a object are dictated by the current state of methods instead of its inheritance from a class/interface like in Java. Why call it duck typing then ? the name refers to the duck test:

“When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck”

This allows use of EAFP or “It’s easier to Ask Forgiveness than Permission”. attributed to Grace Hopper.

A type system is said to be strongly typed when it specifies restrictions on how operations involving values of different data types can be intermixed. If this is not true, then we describe the language as weak. Python is strongly typed.

The built in types are as follows:

  • bool – truth value e.g. True or False
  • int – int of unlimited magnitude
  • list – mutable sequence (supports mixed types) e.g. [4, 4.0, “Four”]
  • tuple – immutable sequence (supports mixed types) e.g. ( 4, 4.0, “Four”)
  • complex – complex number e.g. 3+5j
  • float – floating point number e.g. 3.14
  • dict – group of key and value pairs {“key1”:1.0, “key2”:2.0}
  • set – unordered, no-duplicate collection of values
  • bytes – immutable sequence of bytes
  • bytearray – mutable sequence of bytes
  • str – sequence of Unicode characters

Mutable means that an objects state can be modified after its created, immutable means the opposite.

Python uses the off-side rule

Syntax & Semantics of Python ?

Unlike almost all other languages that I have studied Python uses whitespace to delimit block instead of curly braces (like Java), this feature is termed the off-side rule.

You define a function or method using the def keyword. The typically statements that you would except also apply: if, for, while, try, except and finally.

Python uses the words and, or, not for boolean expressions. Integer division (using //) is defined to round towards minus infinity and floating point division is do using /. Compare by value is achieved using == and compare by reference is achieved using is.

The syntax and semantics of Python highlight the main purpose of Python as a language for teaching programming, hence the focus on readability. This is why Python is the language being taught to school children using the Raspberry Pi

Installing the Interpreter

I install the packages for Python 3 from the Ubuntu Repositories using  sudo apt-get install python3-minimal, and then run the interpreter by entering Python3 into the terminal

Once I get going with Python programming I hope to return to this type of language analysis and review how to design decisions made in the development of Python 3 reflect the use of the language in an introduction to programming. I hope to put up a part 2 of Getting Started with Python, later this week.

As ever, feel free to comment and highlight my mistakes even the spelling/grammar ones.

I’m back from brussels

From this …

I arrived in the back in Cambridge yesterday afternoon after the IBM EMEA Best Student Recognition Event 2012 in Brussels, Belgium. The theme of this years 3 day event was “Data Analytics”. The highlight for me on Wednesday 4th July was the a semenar by Prof. Bart Goethals, a lecturer at University Antwerp on data mining and machine learning in data analytics. For Thursday 5th July my highlights were a talk by Corinna Schulze from the IBM Governmental Program on privacy and a talk by Robert-Jan Sips focusing on how data analytics is being applied to water management in the Netherlands.  And for the final day was highlights was a talk by Guy Lefever on the application of Watson to future medical care. Overall I really enjoyed the event and I would throughly recommend attending if you get the opportunity to go.

It looks as if I’m going to have some new work on Signposts next week but I don’t know as yet what this will be so over the weekend I intend to review some of the new material that I’ve learned over the my first few weeks on the Undergraduate Research Opportunities Project (UROP) here at the William Gates Building, Cambridge. My initial project brief was:

Signposts – the virtual personal internet 

Users with multiple networked devices are having a hard time linking them up when said devices reside at more than one location. Simple example is trying to synch your android phone while on the bus to the laptop in your office and the Mac Mini at home. NATs and Firewalls and other obstacles to symmetric communication abound, and obstruct. Point solutions (Dropbox, iCloud) do not deal with many cases (different OS) and only work if you pay (with money or eyeball time on adverts). In any case, they are not acceptable to privacy conscious users.

Instead, we are building a system called signposts. This is the smallest conceivable virtual machine that sits in the public internet, and looks after your personal namespace of devices. Registration and lookup of names is “effectful”, in other words, has side effects. The side effect is to use a set of possible tactics to rendezvous between the devices (various techniques for NAT traversal, for example) including possible multiple simultaneous path use, and also allowing for constraints, such as battery life, and price of use of connections. The system sits on top of DNSSEC, but more interestingly, as a piece of Computer Science, makes use of the new Cloud OS software being developed in CL, called Mirage (see http://www.openmirage.org/ and https://github.com/avsm/signpostd

This project is to help with the documentation and testing “in the wild” of signposts. We expect to uncover a lot of problems with different oddities (see http://conferences.sigcomm.org/imc/2011/docs/p181.pdf for examples) that appear, and we need to have good stories to account for what we are doing to the public and to security people. OCaml programming is pretty essential, as well as at least basic understanding of development tools (including git etc) “

Over the last few weeks, I had used a wide range of tools that are new or that I had not previously used to this extent including (in no particular order):

  • Unix command line – although this is something that i typically use, I have definitely used it more and for a wider range of applications that ever before
  • Git & GitHub – for my 2nd year group project PROTON, we took the (controversial) decision to use Mercurial and BitBucket instead of Git and GitHub like most other group so it been interesting to see how they compare
  • Tor – As alot of my work over the last few weeks has been looking at testing the performance of tools for online anonymity, I’ve made extensively use of the onion router
  • Orbot – Tor for android
  • Tails – An OS build around tor
  • Iperf – As just explained as I’ve been working with testing network performance Iperf has been an invaluable tool (though I have recently discovered Torperf)
  • Amazon Elastic Compute Cloud – I’ve been using this a server when testing network performance, its really use as it provides as global domain name and IP address and avoid pathological cases that could be created by having other cleint and server on the same network
  • SHH – for remote access to the other computers I am testing with locally and the virtual machine on Amazon EC2
  • dig-host – command line tools for performing DNS queries, just generally useful
  • CyanogenMod – the OS on my android phone which allows easy using of proxies and tethering
  • Privoxy – heavy customizable non-caching web proxy
  • Hamachi – used with privoxy so set up encrypted VPN access
  • OpenVPN – easy set up of VPNs
  • Texmaker – a nice LaTeX editor, that I would throughly recommend which I have been using to keep the notes on my work over the last few weeks
  • Mozilla Firefox & addons – previously a Chrome user myself, I’ve opted to use Firefox since the research project begin for a change and during this time I have experimented with a range of addons for privacy including: HTTPS Everywhere, NoScript,  AdBlock Plus, DoNotTrackPlus, Collusion …
  • Eclipse & Android Development tools – getting to grips with the basics of Android development in preparation for the Google European Android Camp
  • Telnet – for commend line control of Tor
  • Wireshark – very useful for learning about computer networking, allows you to watch the packets on the network
  • TraceRoute – brillient tools which used TTL to should the route that a packet takes to its destination
  • Ocaml – the main project that I’m working with over the summer makes extensive use of Ocaml

I plan to use this list to visit some of the material that I’ve learned over the last few weeks in preparation for new work next week.