WE are living in an era where software is at the centre of everything. Software systems even influence elections, as recently seen in Pakistan.
According to the news, the Results Transmission System (RTS) was developed by Nadra and used by the ECP. The RTS consisted of a mobile application used by presiding officers to submit election results to the RTS servers. Those servers then processed and stored data for ECP use.
The ECP claims that the system collapsed. Nadra said it did not. The public rift is reminiscent of Conway’s law — organisations which design systems are constrained to produce designs which are copies of the communication structures of these organisations. The RTS controversy and communication between the ECP and Nadra can be likened to this law; reliable software systems cannot be created by organisations with poor communication structures.
Did the RTS actually fail? A committee has been formed to investigate. It should have included people from academia and the software industry. It should not only tell us whether or not the RTS failed on July 25, it should also provide a public technical audit of all phases of the RTS development lifecycle (design, implementation, testing and deployment).
A number of questions need to be answered.
A detailed analysis can help us find out what happened, and also determine what not to do in future for such software applications. After all, such controversies leave everyone in doubt about the outcome of the polls for which our institutions had ample time and billions of rupees to prepare.
To evaluate the design phase, many things need to be carefully considered. For mission critical systems like the RTS, the inquiry report must evaluate the correctness and completeness of software requirements (upon which implementation is based). Miscommunication regarding software requirements typically lead to poor implementation results. Conway’s law and the different stances of the involved institutions indicate that this must be carefully evaluated.
Moreover, software systems with extreme reach and political impact need to be built with scalability and fault-tolerance in mind. Software like the RTS must be able to operate under a heavy load. Additionally, data storage design, backups and security should also be considered at the design stage. In this phase, the role of experienced software craftsmen becomes important. A thorough audit of the design phase is important, because considering these things at the design stage is cheaper than costly fixes, loss of reputation and confidence in such systems down the road.
To review the implementation phase, among other things, the composition of the RTS development team and their practices should be evaluated. Did the team consist of competent engineers with different levels of experience? What programming languages and software tools were used by the team and why?
The inquiry report should also evaluate whether modern software development practices of extensive software testing were implemented.
Also, it is important to record detailed statistics about the health of the system, and send alerts to other relevant systems and even people (via email or SMS) when the service is not performing. Did the RTS implement such a strategy?
During the testing and deployment phase, software code should be validated with proper stress testing. It remains to be seen whether or not the RTS was thoroughly tested and deployed in an environment which guarded against security threats and still performed well under a heavy load.
It may be easier to determine if the RTS really failed on July 25 when all these questions are answered. At the moment, we do not have insight into the uptime of servers, state of the RTS database, bug reports, and unaltered logs of the servers. Were the statistics about the health of underlying systems being monitored? If systems stopped performing, were alarms raised on time? Who was supposed to receive those alerts and did they receive them? What do the logs say? It would be interesting to analyse the actual software complaints that the ECP received on the evening of July 25.
When software systems of such scale are deployed, institutions also must consider worst-case scenarios and standard operating procedures. What were the RTS SOPs, and specifically what was the SOP in case of a complete RTS collapse? Was the ECP following SOPs by announcing the collapse of the RTS on its own?
Lastly, the communication within and among our public institutions is dismal. When information doesn’t move freely within organisations it becomes even more important to create tools that help identify the problem. Better automation and modern software development practices can help in this regard. Hopefully, the (public) report of this committee will not make us repeat the same mistakes after another five years and billions more rupees.
The writer is a freelance contributor based in Lahore.
Published in Dawn, August 16th, 2018