No Description

Expresiones Regulares - Validación de entradas

Objetivos

Practicar la implementación de expresiones regulares para sanitizar los datos de entrada.

Pre-Modulo:

Antes de trabajar en este módulo el estudiante debe

Haber repasado la construcción de expresiones regulares.
Haber repasado la implementación de expresiones regulares en python.

Validación de entradas usando expresiones regulares

La validación de entradas es escencial en el desarrollo de aplicaciones seguras ya que es la primera línea de defensa de todas las aplicaciones. La gran mayoría de las vulnerabilidades en aplicaciones es debido a pobre validación de las entradas de las aplicaciones. Ejemplo de vulnerabilidades explotadas por pobre validación de entradas lo son:

Desbordamiento del Búfer
Cross Site Scripting
Inyección de SQL, XML u otros lenguajes
Recorrido de directorios
Acceso a archivos privados

Los atacantes utilizan errores en la validación de entradas para acceder a información privilegiada, para acceder a los sistemas, o para causar negación de servicios.

Utilizando expresiones regulares la aplicación puede validar y solo aceptar datos de entradas que sean aceptadas por la especificación de una expresión regular. Como por ejemplo se puede restringir a que la entrada consista solo de dígitos, o como vamos a hacer en este módulo, que las entradas sigan especificaciones estrictas de nombres de usuario, correos electrónicos, dominios y direcciones de IP versión 4.

Descargo de responsabilidad Las expresiones regulares para los usuarios, correos electrónicos, ni dominios que se van a construir en este módulo incluyen todas las especificaciones según la definicion en el RFC[2].

Instrucciones generales

En este módulo utilizaremos la librería re (regular expression) para a través de definiciones de expresiones regulares validar cadenas de entradas almacenadas en archivos.

En general por cada ejercicio el estudiantes va a definir una expresion regular para validar una cadena de caracteres de entrada y una aplicación de python va a validar las cadenas y desplegar si la cadena es valida según la expresion regular provista.

La expresión regular se va a definir en la variable REGEXP que se encuentra en las primeras líneas de código del script regex-tester.py.

Luego de definir la expresión regular el estudiante debe correr el script con el archivo de entrada del ejercicio. Por ejemplo:

python regex-tester.py emails.txt

ejecuta el script regex-tester.py con el archivo de entrada emails.txt

Note que toda expresión regular que el estudiante defina, va a ser finalmente entregada al instructor, así que anote todos las expresiones regulares usadas en los ejercicios.

Ejercicio 1: Expresiones regulares para nombres de usuarios.

En el archivo regex-tester.py cambie el contenido de la variable REGEXP por la siguiente expresión regular: [a-zA-Z0-9]+

Como debe notar esta expresión regular reconoce cualquier cadena de caracteres compuesta de uno o mas (+) letras [a-zA-Z] o digitos [0-9].
Corre el script regex-tester.py con el archivo usernames.txt.

python regex-tester.py usernames.txt

Como resultado debe ver que todas las pruebas pasaron:

26 strings passed, 0 strings failed y que nombres de usuarios como:

23bolitas y 123pescao pasaron la prueba.
Modifique la expresión regular para que solo acepte nombres de usuarios que comiencen con una letra y no acepte usuarios como 23bolitas y 123pescao.
Corra el script regex-tester.py con el archivo usernames.txt.

python regex-tester.py usernames.txt

Como resultado debe ver que todas las pruebas pasaron menos 23bolitas y 123pescao:

24 strings passed, 2 strings failed
Corra el script regex-tester.py con el archivo usernames_swletter.txt

python regex-tester.py usernames_swletter.txt

Como resultado debe ver que todas las pruebas pasaron:

26 strings passed, 0 strings failed
Corra el script regex-tester.py con el archivo usernames_swletter_dotted.txt

python regex-tester.py usernames_swletter_dotted.txt

Como resultado debe ver:

10 strings passed, 20 strings failed

debido a que 16 de los nombres de usuarios en ese archivo estan compuestos de nombres de usuarios con puntos entre medio o debido a que los nombres comienzanan con digitos o puntos.
Modifique la expresión regular para que solo acepte nombres de usuarios que tengan puntos entre medio y cuyos nombres a los lados de los puntos comienzen con una letra. Ejemplos: nombre.apellido, nombre1.apellido1, nombre1.apellido1.apellido2, …
Corra el script regex-tester.py con el archivo usernames_swletter_dotted.txt

python regex-tester.py usernames_swletter_dotted.txt

Como resultado debe ver que todas las pruebas pasaron menos 4:

26 strings passed, 4 strings failed

debido que las ultimos 4 líneas tienen nombres de usuario que o tienen puntos al princio o al final de nombre de usuario, o que los nombres a los lados de los puntos comienzan con digitos.

Note que debe tener apuntado las expresiones regulares de los pasos 3 y 7.

Ejercicio 2: Expresiones regulares para nombres de dominios.

En el archivo regex-tester.py cambie el contenido de la variable REGEXP por la siguiente expresión regular: [a-zA-Z0-9]+.(com|edu|net|org)

Como debe notar esta expresión regular reconoce cualquier cadena de caracteres compuesta de uno o mas (+) letras [a-zA-Z] o digitos [0-9] seguidos por un punto y uno de los siguientes dominios de nivel superior com, edu, net, org.
Corre el script regex-tester.py con el archivo domains.txt.

python regex-tester.py domains.txt

Como resultado debe ver que 10 pruebas pasaron y 10 fallaron:

10 strings passed, 10 strings failed.

Esta expresión regular solo acepta nombres de dominios de que consisten de un nivel superior com, edu, net, org y un segundo nivel de dominio como gmail.com, facebook.com, y twitter.com, pero no dominios con mas niveles como www.facebook.com y mail.gmail.com.
Modifique la expresión regular para que acepte dominios con más de dos niveles incluyendo los dominio de nivel superior com, edu, net, org.
Corre el script regex-tester.py con el archivo domains.txt.

python regex-tester.py domains.txt

Como resultado debe ver que 16 pruebas pasaron y 4 fallaron:

16 strings passed, 4 strings failed.

Esto debido a que hay dominios de nivel superior mil (airforce.mil), gov (whitehouse.gov), y dominios de nivel superio de códigos de país (www.isla.com.pr), y un dominio de nivel superior invalido pur (www.isla.net.pur).
Modifique la expresión regular para que acepte dominios con más de dos niveles, pero en adición incluya los dominios gov, mil, biz, y cualquier dominio de nivel superior de códigos de país. Los dominios de nivel superior de códigos de país se componen de dos letras al final del dominio. Por ejemplo us (United States), pr (Puerto Rico), ca (Canada), ar (Argentina). En otras palabras el dominio de nivel superior puede ser com, edu, net, org, gov, mil, biz, o cualquier combinación de dos letras (solo dos letras).
Corre el script regex-tester.py con el archivo domains.txt.

python regex-tester.py domains.txt

Como resultado debe ver que todas las pruebas pasaron menos la del dominio invalido pur.

19 strings passed, 1 strings failed.

Note que debe tener apuntado las expresiones regulares de los pasos 3 y 5.

Ejercicio 3: Expresiones regulares para correos electrónicos

En el archivo regex-tester.py cambie el contenido de la variable REGEXP por la siguiente expresión regular: [a-zA-Z0-9]+\@gmail.com

Como debe notar esta expresión regular reconoce cualquier cadena de caracteres compuestas de nombres de usuarios de uno o mas (+) letras [a-zA-Z] o digitos [0-9] seguido de @ y el dominio gmail.com.
Corre el script regex-tester.py con el archivo gmail_emails.txt.

python regex-tester.py gmail_emails.txt

Como resultado debe ver que 5 pruebas pasaron y 14 fallaron:

5 strings passed, 14 strings failed.

Esto debido a que a pesar que todos los correos electrónicos son de gmail, la expresion regular para los nombres de usuarios no permite nombres de usuarios que contienen punto entre medio.
Modifique la expresión regular para que como en el Ejercicio 1.7 los usuarios de los correos electrónicos puedan contener puntos entre medio de nombres que comienzan con una letra.
Corre el script regex-tester.py con el archivo gmail_emails.txt.

python regex-tester.py gmail_emails.txt

Como resultado debe ver que todas las pruebas pasaron:

19 strings passed, 0 strings failed.
Modifique la expresión regular para que como en el Ejericios 2.5 los correos electrónicos puedan tener dominios con más de dos niveles y el dominio de nivel superior pueda ser com, edu, net, org, gov, mil, biz, o cualquier combinación de dos letras (solo dos letras).
Corre el script regex-tester.py con el archivo emails.txt

python regex-tester.py emails.txt

Como resultado debe ver que todas las pruebas pasaron:

19 strings passed, 0 strings failed.

Note que debe tener apuntado las expresiones regulares de los pasos 3 y 5.

Ejercicio 4: Expresiones regulares para direcciones IP versión 4.

Una dirección de IP version 4 consiste de 4 bytes en representación decimal donde los 4 bytes estan separados por un punto. Por ejemplo: 10.12.20.5, consiste de 4 números decimales 10, 12, 20 y 5 separados por un punto. Los números decimales van desde 0 hasta 255 ya que cada uno representa un byte. Por lo tanto la siguiente cadena de caracteres no es una dirección de IP 10.244.260.21 por que tiene el número decimal 260 que es mayor que 255.

En el archivo regex-tester.py cambie el contenido de la variable REGEXP por la siguiente expresión regular: (\d+.)+\d+

Como debe notar esta expresión regular reconoce cualquier cadena de caracteres compuestas de uno o mas digitos divididos por puntos y terminando con uno o más digitos.
Corre el script regex-tester.py con el archivo ip_4like.txt.

python regex-tester.py ip_4like.txt

Como resultado debe ver que todas las pruebas pasaron:

26 strings passed, 0 strings failed

y que IPs como 10.0.1 y 10.1.1.1.1 son aceptados aún cuando a uno le falta un decimal y al otro le sobran.
Modifique la expresión regular para que acepte solo IPs cuatro números decimales divididos por un punto.
Corre el script regex-tester.py con el archivo ip_4like.txt. python regex-tester.py ip_4like.txt

Como resultado debe ver que todas las pruebas pasaron menos dos:

24 strings passed, 2 strings failed

y que los IPs 10.0.1 y 10.1.1.1.1 ya no son aceptados. Pero todavía IPs de cuatro números decimales como 256.10.11.1, 10.0.0.300, 10.1000.10.1 son aceptados aún cuando tienen números decimales mayores que 255.

Note que utilizando la regla \d{1,3} podemos restringir a números de 1 a 3 digitos. Pero esto aún permitiría los IPs 256.10.11.1, 10.0.0.300.
Modifique la expresión regular para que solo acepte 4 números decimales de 1 a 3 dígitos separados por puntos.
Corre el script regex-tester.py con el archivo ip_4like.txt.

python regex-tester.py ip_4like.txt

Como resultado debe ver que 23 pruebas pasaron y 3 fallaron:

23 strings passed, 3 strings failed debido a que ahora tampoco acepta el IP 10.1000.10.1.
Como último desafío, modifique la expresión regular para que solo acepte IP versión 4 validos. Esto es de 4 bytes en representación decimal divididos por puntos.
Corre el script regex-tester.py con el archivo ip_4like.txt.

python regex-tester.py ip_4like.txt

Como resultado debe ver que 21 pruebas pasaron y 5 fallaron:

21 strings passed, 5 strings failed

Note que debe tener apuntado las expresiones regulares de los pasos 3, 5 y

Entregas

Entregue las expresiones regulares construidas en los ejercicios: 1.3, 1.7, 2.3, 2.5, 3.3, 3.5, 4.3, 4.5, 4.7 al instructor.

Referencias

[1] Python Regular Expression Library, https://docs.python.org/2/library/re.html [2] Internet Message Format, Address Specification, http://tools.ietf.org/html/rfc5322#section-3.4

English | Español

Regular Expressions - Input Validation

Objectives

Practice the implementation of regular expressions to sanitize input data.

Pre-Module:

Before working in this module the student must

Have reviewed the construction of regular expressions.
Have reviewed the implementation of regular expressions in python.

Input validation using regular expressions

Input validation is esential in the development of secure applications because it is the first line of defense of every application. The vast majority of vulnerabilities in applications is due to poor input validation of the applications. Example of vulnerabilities explited by poor input validation are:

Buffer overflows
Cross Site Scripting
SQL, XML or other languages injection
Directory Traversal
Access to private files

The attackers use errors in input validation to gain access to priviledged information, to gain access to systems, or to cause denial of services.

Using regular expressions the application can validate and only accept input data accepted by the specification of a regular expression. For instance we can restrict the input to consist of digits only, or as we will do in this module, that the input follow strict specifications of usernames, emails, domains, and IP version 4 addresses.

Disclaimer The regular expression for users, emails, and domains that will be constructed in this module include the specifications as defined in the RFC[2].

General Instructions

In this module we will use the re (regular expression) library to through the definitions of regular expressions validate input strings stored in files.

In general, for each exercise the students will define a regular expression to validate an input strings and a python application will validate the strings and display if the string is valir as specified by the given regular expression.

The regular expression will be defined in the variable REGEXP which is found in the first lines of the script code in regex-tester.py.

After defining the regular expression the student will run the script with the input file of the exercise. For example:

python regex-tester.py emails.txt

executes the script regex-tester.py with the input file emails.txt

Note that every regular expression that the student define, will be finally delivered to the instructor, thus take note of all the regular expressions used in the exercises.

Exercise 1: Regular expressions for usernames

In the file regex-tester.py change the content of variable REGEXP by the following regular expression: [a-zA-Z0-9]+

As you must notice the regular expression recognizes any string composed of one or more (+) letters [a-zA-Z] or digits [0-9].
Run the script regex-tester.py with file usernames.txt.

python regex-tester.py usernames.txt

A a result you must see that all the tests passed:

```26 strings passed, 0 strings failed``` 
and the user names like: 23bolitas y 123pescao passed the test.

Modify the regular expression such that it only accepts usernames that begin with one letter and do not accept users like 23bolitas y 123pescao.
Run the script regex-tester.py with file usernames.txt.

python regex-tester.py usernames.txt

As a result you must notice that all the tests passed except 23bolitas y 123pescao:

24 strings passed, 2 strings failed
Run the script regex-tester.py with file usernames_swletter.txt

python regex-tester.py usernames_swletter.txt

As a result you must notice that all the tests passed:

26 strings passed, 0 strings failed
Run the script regex-tester.py with file usernames_swletter_dotted.txt

python regex-tester.py usernames_swletter_dotted.txt

As a result you must see:

10 strings passed, 20 strings failed

because 16 of the usernames in that file are composed of usernames with dots in between or because the names begin with digits or dots.
Modify the regular expression such that it only accepts usernames with dots in between and whose names to the sides of the dots begin with one letter. Examples: name.last, name1.last1, name1.last1.last2, …
Run the script regex-tester.py with file usernames_swletter_dotted.txt

python regex-tester.py usernames_swletter_dotted.txt

As a result you must notice that all the tests passed except 4:

26 strings passed, 4 strings failed

because the last 4 lines have usernames that either have a dot at the begining or the end of the username, or that the names to the sides of the dots begin with digits.

Note that you most have take note of the regular expression of steps 3 and 7.

Exercise 2: Regular expression for domain names.

In file regex-tester.py change the content of variable REGEXP by the following regular expression: [a-zA-Z0-9]+.(com|edu|net|org)

As you must notice this regular expression recognizes any string compose of one or more (+) letters [a-zA-Z] or digitos [0-9] followed by one dot and one of the following top level domains com, edu, net, org.
Run the script regex-tester.py with file domains.txt.

python regex-tester.py domains.txt

As a result you must see 10 tests passed and 10 failed:

10 strings passed, 10 strings failed.

This regular expression only accepts domain names that consist of one top level domain com, edu, net, org and a second domain level like gmail.com, facebook.com, and twitter.com, but not domains with more levels such as www.facebook.com and mail.gmail.com.
Modify the regular expression to accept domains with more that two leves including the top level domain com, edu, net, org.
Run the script regex-tester.py with file domains.txt.

python regex-tester.py domains.txt

As a result you must notice that 16 tests passed and 4 failed:

16 strings passed, 4 strings failed.

This because there are top level domains mil (airforce.mil), gov (whitehouse.gov), and country code top level domains (www.isla.com.pr), and an invalid top level domain pur (www.isla.net.pur).
Modify the regular expression to accept domains with more than two leves, but in addition includes the domains gov, mil, biz, and any country code top level domain. The country code top level domains are composed of two letters at the end of the domain. For example us (United States), pr (Puerto Rico), ca (Canada), and ar (Argentina). In other words the top level domains can be com, edu, net, org, gov, mil, biz, o any combination of two letters (only two letters).
Run the script regex-tester.py with file domains.txt.

python regex-tester.py domains.txt

As a result you must notice that all the tests passed except the one with the invalid domain pur.

19 strings passed, 1 strings failed.

Note that you most have take note of the regular expression of steps 3 y 5.

Exercise 3: Regular expressions for emails

In file regex-tester.py change the content of the variable REGEXP by the following regular expression: [a-zA-Z0-9]+\@gmail.com

As you must notice this regular expression recognizes any string composed of usernames of one or more (+) letters [a-zA-Z] or digits [0-9] followed by a @ and the domain gmail.com.
Run the script regex-tester.py with the file gmail_emails.txt.

python regex-tester.py gmail_emails.txt

As a result you must notice that 5 tests passed and 14 failed:

5 strings passed, 14 strings failed.

This is because despite that all the emails are from gmail, the regular expression for the usernames does not permit usernames that contains dots in between.
Modify the regular expression such that like in Exercise 1.7 the emails usernames can contain dots in between the names that begin with a letter.
Run the script regex-tester.py with file gmail_emails.txt.

python regex-tester.py gmail_emails.txt

As a result you must notice that all the tests passed:

19 strings passed, 0 strings failed.
Modify the regular expression such that like in Exercise 2.5 the emails can have domain with more than two leves and the top level domain can be com, edu, net, org, gov, mil, biz, or any combination of two letters (only two letters).
Run the script regex-tester.py with file emails.txt

python regex-tester.py emails.txt

As a result you must notice that all the tests passed:

19 strings passed, 0 strings failed.

Note that you most have take note of the regular expression of steps 3 y 5.

Exercise 4: Regular expression for IP version 4 addresses.

An IP address version 4 consists of 4 bytes in decimal representation where the 4 bytes are separated by a dot. For example 10.12.20.5, consists of the 4 decimal numbers 10, 12, 20, and 5 separated by a dot. The decimal numbers range from 0 to 255 since each of them represent a byte. Therefore the following string is not an IP address 10.244.260.21 because it has the decimal number 260 which is greater than 255.

In file regex-tester.py change the content of the variable REGEXP by the following regular expression: (\d+.)+\d+

As you must notice this regular expression recognizes any string composed of one or more digits separated by dots.
Run the script regex-tester.py with file ip_4like.txt.

python regex-tester.py ip_4like.txt

As a result you must notice that all the test passed:

26 strings passed, 0 strings failed

and IPs such as 10.0.1 and 10.1.1.1.1 are accepted even when one of them is missing one decimal and the other has a spare decimal.
Modify the regular expression such that it accepts IPs with four decimal numbers separated by one dot.
Run the script regex-tester.py with file ip_4like.txt.

python regex-tester.py ip_4like.txt

As a result you must notice that all the tests passed except two:

24 strings passed, 2 strings failed

and the IPs 10.0.1 and 10.1.1.1.1are no longer accepted. But still IPs of four decimal numbers like 256.10.11.1, 10.0.0.300, 10.1000.10.1 are accepted even when they contain decimal numbers greater than 255.

Note that using the rule \d{1,3} we can restrict to numbers of 1 to 3 digits. But this will still permit the IPs 256.10.11.1, 10.0.0.300.
Modify the regular expressions to accept 4 decimal numbers of 1 to two digits separated by dots.
Run the script regex-tester.py with file ip_4like.txt.

python regex-tester.py ip_4like.txt

As a result you must notice that 23 tests passed and 3 failed:

23 strings passed, 3 strings failed

because it now does not accept the IP 10.1000.10.1.
As a last challenge, modify the regular expression such that it only accepts IP valid IP version 4 addresses. This is addresses of 4 bytes in decimal representation separated by dots.
Run the script regex-tester.py with file ip_4like.txt.

python regex-tester.py ip_4like.txt

As a result you must notice that 21 tests passed and 5 failed:

21 strings passed, 5 strings failed

Note that you most have take note of the regular expression of steps 3, 5 y 7.

Deliverables

Submit the regular expressions constructed in exercises: 1.3, 1.7, 2.3, 2.5, 3.3, 3.5, 4.3, 4.5, 4.7 to the instructor.

References:

[1] https://docs.python.org/2/library/re.html

[2] Internet Message Format, Address Specification, http://tools.ietf.org/html/rfc5322#section-3.4

README.md 24KB Permalink History Raw