Home » Poverty Lab

Poverty Lab



The MDGs global agenda has posed several questions on the way to address poverty analysis. New measurement methodologies along with the new data revolution sets a new holistic approach to poverty analysis. The goal of eradicating poverty by 2030 shall only be possible if poverty is tackled in an integrated way.

2014-04-11 19.46.54manfarmimgFamilia Fantasma



This Poverty Lab aims to work as a technical clinic that analyses the pratical nits and bits of poverty evaluation. It intends to go beyond theories and provide practical insights based on an hands on daily process of monitoring poverty. The idea is to discuss the specific procedures that support poverty monitoring identifying tools and discussing practical methodologies, statistic techniques and survey analysis.  Ultimately the sum of all notes shall produce a practical Handbook on poverty analysis.

graficos hhscalculadora e lapiscaneta statistics



There are several tools that can be used to monitor or simply help with data analysis of poverty.

ADePT: Software Platform for Automated Economic analysis.

Is a software developed by the WB that allows the automatic production of several analytical reports in a standardised way. It is possible to use micro-level data from various types of surveys, such as Household Surveys, Demographic and Health Surveys (DHS) and Labor Force surveys to produce multiple and sophisticated sets of tables and graphs  choosing different categories of economic research.,,menuPK:7108381~pagePK:64168176~piPK:64168140~theSitePK:7108360,00.html


Book: User guide for ADePT version 5.0.

Find in:

Book: Unified Approach to Measuring Poverty and Inequality

Find in:



The Importance of Survey Design: How much sampling can affect Household Surveys statistical quality?

convenience-samplingWhen we start the preparation of a HHS the first step to conduct the survey is to compile the sample frame. Normally omitted or taken for granted in most studies as an operational procedure truth is that the design of surveys and specifically the type of households they collect have effect on the calculations of descriptive statistics such as means and standard errors.

In the ideal world the perfect and simplest HHS should be based on an up-to-date list of all households in the population where the design appoints an equal probability to each household being selected from the list to participate in the survey and that indeed all households asked to participate in the survey actually do so. This would lead to a simple random sample typically representative of the whole population. In practice household surveys collect data from national list of households commonly named frame that are often census. Most frequently sample sizes are constructed on a 1:500 fraction relationship in a way that for example a 10,000 sample corresponds to a population of 5 million. However in the construction of HHS experts come across several challenges that go beyond these ideal situations and demand the use of alternative methodologies. Sometimes there is bad data, censuses are not updated or have inconsistent information. In other situations data is initially good but survey design excludes groups of populations leading to non-coverage errors. Sometimes during the survey process good data gets deteriorated: in some cases non respondent households lead to missing values and without proper data cleaning some good households can be lost in the statistical process due to transcription errors or implausible values.  Most of all what is fundamental to understand is that albeit theory sampling is hardly a pure exercise and is highly dependent of the prior analysis of the amount and quality of data that will ultimately determine the methodologies used.

Non-Coverage and Non-respondent

The most important issue in sampling analysis is to guarantee that the sample represents the population, but in reality this is hardly the case. Although used to evaluate poverty sometimes the HHS by definition does not include some relevant groups of the population that may be the ultra or extreme poor such as indigents, criminals or orphans. Surprisingly non-coverage i.e. omitting certain groups from the original frame is not a source of error in survey estimates typical of developing countries. If we think that bad data due to non-coverage is a myth in developed countries we are surely wrong. For example the USA censuses are politically sensitive so that various interest groups can influence the count. Even when the frame is rigorous the coverage of the population may create biased results from the start if we consider that homeless are automatically excluded because they do not comprise the category of households. The same happens to people in institutional settings such as military or prisoners.  Another interesting example comes from the UK Family Expenditure Survey that regularly underestimates aggregate alcohol consumption by nearly half due to coverage errors by excluding people with high alcohol consumption.

Another issue adding to these error estimates is the non-response effect, indeed in some situations we do have an accurate and updated frame but albeit this difficult task of gathering the wright households some households do not respond. Albeit the repeated complaints of bad quality data in developing countries this is less of a problem in these countries than in USA or Britain where the nonresponse is typically much larger.. Non respondent households dampen the quality of the survey because it includes relevant households that instead will produce missing values reducing the number of observations. While in developed countries non response is associated with lack of time, privacy issues or absent engagement in the process, instead households in developing countries show almost complete cooperation except for example on cases where wealthy households are asked about their incomes and assets.

Clusters and Two Stage Sampling

Unfortunately experience show that more than often census are not updated or have unreliable data. In other situations the entire data may be available but considering the entire population is just too expensive. In both cases the selection of households is less straight forward and there is the need to construct different frames. A common method used in these cases is to use clusters. It constructs samples based on a two-stage design. The first step is to select households from a list of clusters and then choose households from these subgroups. For example in rural areas, the clusters are often villages from which the households are chosen directly. Practice show that the use of clusters allows better identification of households in the field if there is available an up-to-date list. Most of the times clusters are intuitively drawn from census that typically have subunits that can be used for this first stage sampling. The two stage sampling is not inconsistent with each household having an equal chance of selection in the sample if clusters are randomly selected with probability proportional to the number of households they contain and if the same number of households are selected from each cluster. The two stage sample design has numerous advantages compared with simple random samples. Not only allows for better identification of households, but it also takes into consideration a crucial issue in a survey: it´s cost. HHSs are very expensive so it is a constant priority to find the solution that is more cost-effective. The two stage sampling by focusing on geographical groups instead of generating a sample that is randomly distributed over space, is a cheaper solution if we consider that instead of visiting households that are widely dispersed over the territory the survey team will easily travel from village to village for example. It also facilitates repeated visits to collect information from respondents that were not at home or to monitor the progress of record keeping or just ask supplementary questions. It can also allow collecting village-level information such as public goods as schools, clinics, crop data (community questionnaires). In other situations the purpose of the survey enhances the need for clusters. Let´s say that the survey´s objective is to study the economic effect of AIDS. In this case a random sample of the population would not produce many households with an infected person, so sometimes operational issues dictate that some groups are more intensively sampled than others so that coverage is guaranteed for some groups allowing statistical inference.


If in prior data analysis subpopulations vary considerably within the population it is advantageous to break a single survey into multiple independent stratums (geographical area, ethnic group, levels of living, etc). This means to divide members of the population into homogeneous subgroups before sampling. The strata should be mutually exclusive, i.e. every element in the population must be assigned to only one stratum. The strata should also be collectively exhaustive: no population element can be excluded. Then simple random sampling is applied to each stratum. It also improves statistical inference when means of the subgroups significantly differ reducing sampling variability. This process effectively converts one survey into several independent surveys guaranteeing in advance that there will be enough observations to permit estimates for each of the groups. It is not only cost effective, but it improves statistical inference quality because the variance is larger in the simple random sample than in the stratified sample, but in the unlikely event that means coincide with the grand mean there is no increase in efficiency from stratification. A useful concept in assessing how the sample design affects precision is Kish´s “design effect” often referred to as deff. It is the ratio of the variance of stratified or cluster sample and the variance of simple random sampling. As we have seen if stratified variance is lower than simple random sampling this ration is normally less than 1, however for clusters the ratio is larger than one which means that it increases errors. Clustering will increase sample variability compared with simple random sampling because households within clusters are often similar to one another in their relevant characteristics.

Weighting and Inflation factors

In most surveys different households have different probabilities of being selected into the sample. As we have seen the purpose of the survey may dictate that some types of households are overrepresented relative to others. This can be done either deliberately due to sample design or accidentally for example when a considerable part of the sample does not respond to the questionnaire. In both cases sample means will be biased estimators of population means. To undo this bias the sample data are reweighted to make them representative of the population. These weights are also referred as inflation factors and they are calculated as inversely proportional to the probability of being selected. Then the weights will be multiplied by the biased means and errors to compensate the differential induced either by sample design or non-respondent households.


In survey data when we calculate descriptive statistics it is necessary to guarantee that these statistics describe the population rather than the particular sample that we use for analysis. In order to achieve this it is mandatory to know how the sample was designed and what is the relationship between the sample and the population, because different sample designs require the data to be processed in different ways for estimations to have the same magnitude. To make inference we need a framework for thinking about how the data was generated and most of all how data collection induces randomness into the sample. However sampling should not become an obsession. It should be recognised that every rational design involves some kind of sampling errors and that they are most often substantially smaller than the non-sampling errors. Common sense is always the best solution so the idea is to keep it simple and avoid over correction of complex designs otherwise being over rigorous can become counterproductive.



How much data collection can influence the analysis of household surveys?

data collectionData collection is often the most time-consuming part of quantitative methods. Proper handling of data and good data management skills can save researchers not days, but months of work. If you collect, clean and merge your data correctly, you will have a much easier time with your data analysis. After choosing the sample frame and constructing the questionnaire the household survey starts materializing during data collection. In this phase team leaders do the planning of the data collection process and assign households to individual interviewers and specific districts. Performed usually by the National Statistics Offices data collection is a complex operational process that involves the logistics and management of enumerators that travel to districts to do the interviews and entry the data. HHS data management mainly consist of the process of entering data retrieved from field interviews, checking it for errors, reviewing and clean it for final dissemination.

In data cleaning the most common errors are consistency and completion errors. There are consistency errors when the value introduced is not in line with the unit of measures or was wrongly introduced. Another big issue is missing values that correspond to observations that were skipped and have no number associated. Here are some examples of reported errors:


  • Case Data Consistency Reports

*** Case [0043001] has 14 messages (4 E / 0 W / 10U)

E 88182 Inconsistent field detected… HH_C22J(3) is not a skipped field, however is NotAppl

U 800001 ERROR! HH_G03B(2): Invalid Item Unit Combination. The unit (05C) is not valid for ITEM (HH_G02 = 101: Item Code). Please correct!


  • Case Completion Reports

*** Case [0033] has 1 messages (0 E / 0 W / 1U)

U   -69 EA – (10101063) : THE HOUSEHOLD (33) IS NOT HAVING ANY T0 OR T1!!!

*** Case [0035] has 1 messages (0 E / 0 W / 1U)

U   -69 EA – (10101063) : THE HOUSEHOLD (35) IS NOT HAVING ANY T0 OR T1!!!


In LSMS in each round households are visited twice with an interval of two weeks and roughly half the questionnaire administered at each visit. Between the two visits, the first data are entered into the computers and are automatically subject to editing and consistency checks by the software. Such procedures not only minimize data entry errors, but permit the enumerators to correct some errors during the second visit. Today most error reporting is based on computer software that automatically identifies the errors maximizing benefits of technical tools across all survey areas.

For decades questionnaires and data entry were processed manually but today’s standard data management avoids manual handling.  Albeit a common procedure truth is that pasting individual data cells from one source to another, or typing data by hand were very prone to human error and typically results in erroneous data. From a methodological perspective this is the greatest possible disaster, because no statistical theory or method can help with it leading to a study that is endemically wrong. The good news is that currently most of data management systems are based on a great range of computer based software. The Computer Assisted Personal Interview CAPI developed by the WB allows using tablets as a platforms for data entry in HHS interviews substituting the paper based interview prone to deliver manual entry errors. It also reduces number of coding errors due to validation mechanisms that make it impossible to enter values outside a given range and due to automated routing that reduces the incidence of missing data.  It skips steps and improves management accountability as supervisors may also view and check the collected information as soon as the enumerators finish the interviews. It is also more flexible as it allows changes in the structure of the questionnaire that can be instantly reflected on the interviewers’ devices and provides a dynamic structure in which questions will vary depending on the answers given by the respondent.

There is also other software such as Census and Survey Processing System CSPro used by the USA that allows the translation of questionnaires into data dictionaries used to construct data entry applications into which the results of the survey are introduced. It also has the advantage of delivering automatic error reporting. When data is sent to the HQ for first line process and the feedback is sent to for mid-stream adjustments new technology options for data transfer such as USB, Internet FTP/VPN server and web based automatic syncronisation are now available and are improving time and accuracy while assuring confidentiality.

Technology prevents error but one of the guiding principles of any data management is to give priority to prevention rather than to treatment of errors. And this generally means a great investment prior to the start of the field work in training. Although the training of human resources is often what makes HHS so expensive it is also known that there is always a trade-off between accurate and cheap data. The best way to prevent errors is to assure that all survey staff & management are properly trained. This along with a good and clear administration of the process of data collection significantly improves data quality reducing the need for data cleaning and benefiting field management & supervision.

Although often forgotten and seen essentially as a very procedural stage data collection crucially determines the data quality of a HH survey. In fact experience shows that in data collection a robust data management system is the most cost effective investments to meet the current demands for the quality, quantity and availability of data in the context of post-2015 agenda. Please see some links on tools that can help in data collection


CAPI: Computer Assisted Personal Interviewing

This an innovation to the PAPI Paper based Personal Interviewing which provides a tablet-based software platform that allows several management functions of the data collection process such as: management of survey personnel (enumerators), allocates assignments, data entry, validation and data transmission and syncronisation.,,contentMDK:23426734~pagePK:64168182~piPK:64168060~theSitePK:8213597,00.html,





Data Analysis: How to deal with Outliers and Unusual Observations in HHS?

images outlierOnce data collection and data cleaning process has been concluded researchers start what is called the model checking process. One of the first data analyses to be done in regression diagnosis is to check for outliers. It is commonly done at this early stage of research because it comes as a transitional step from data collection process because there is still the possibility that some abnormal observations may be the result of incorrectly entering or bad measurement that can still be checked at the final stage of the data cleaning.

But what is an outlier?  They are observations that have values that largely deviate from most sample or in other words have large residuals This is a problem mainly if it occurs in least squares because these extreme values of observed variables can distort estimates of regression coefficients so by seriously influencing most parametric statistics, like means, standard deviations, and correlations it dampens the accuracy of statistical inference. What to do in this case and what is the best treatment? Although it is one of those statistical issues that everyone knows about, most aren’t sure how to deal with it. The best solution will depend on the specific situation so the first question that should be asked is if this observation is an unusual observation due to some peculiar sample feature or due to data entry error. Most commonly they reflect coding errors in the data when for e.g. the decimal point is misplaced or when we have failed to declare some values as missing. If it is obvious that the outlier is due to incorrectly entered or measured data you can either correct (if you still can) or otherwise you should decide to drop the observation (with proper justification). That is why this diagnosis should be an intricate part of final data cleaning process to allow the identification of the types of outliers. In other cases outliers may result of model misspecification such as variables that have been omitted or demand other functional forms. For example the Log form tends to smooth out the effect of outliers, so they can instead just suggest that we need to assess an appropriate functional form or choose a different one. It may also suggest that the outlier belongs to a population different from the one we want to study and in this case we can drop this observation (upon justification). It is also common that small samples are especially vulnerable to outliers -there are fewer cases to counter the outlier-so the larger the sample the less probability of encountering them so by enlarging the sample we may make outliers disappear.

After identifying the obvious situations that you can omit the outlier observation we still have others that demand more careful attention. The main dilemma when we face a true outlier is always: to drop or not to drop? Surely to blindly drop legitimate observations that are sometimes the most interesting ones and that are clearly influential not only is not technically reasonable but will indiscriminately amputate precious information from the model. Most of all it is crucial to investigate the nature of the outlier before deciding it. In fact analyzing with precision the typology of outliers within the range of unusual and influential observations is what should determine the treatment. Let´s look below at different types of situations to understand in which cases outliers affect more the regression.

fotos de outliers

There are indeed two types of situations in which abnormal observations can influence regression: when there is discrepancy and leverage. While  discrepancy relates to extreme observations in y`s direction or in other words observations that have a large residual (typical outlier), leverage are observations that are far away from the average predictor values, measuring how far an independent variable deviates from its mean (extremes in x`s direction).

It is the combination of these two phenomena called influence in statistics that allows understanding the capacity of a single observation of influencing the results of the regression analysis. One intuitive way of identifying these effects is to plot the regression graph and look at the observations. Looking at the graphs above we see that in the first graph on the top left corner there is no leverage and no outlier so in this case we have no influential observations, in this case we have no observation that needs to be studied or even dropped. On the top right there is low leverage but one outlier, so residuals will be slightly higher, but slope coefficient will barely change, i.e. intercept, t-values and statistics will be different but will not affect summary statistics. In this case the outlier does not change the results although it affects assumptions so you may more easily drop this type of outlier. On the down left there is no outlier albeit the high leverage so this extreme observation is harmless. If deemed necessary it can easily be dropped because the rule is that you may omit an observation if neither the presence nor absence of the outlier would change the regression line.

More commonly, we see observations like on the down right graph that correspond to an outlier with high leverage. In this situation, it is not legitimate to simply drop the outlier, because the outlier affects both results and assumptions. This is the case that will influence the regression line more dramatically because it will affect the intercept and test statistics but also the slope coefficients seriously. So we may conclude that outliers with low leverage can more easily be dropped (under justification) because will have less impact on the results, but dropping an outlier with high leverage will dramatically change the regression line. In other words the higher is leverage the more influence the outlier will have on regression. What it is interesting in the outlier analysis is that it will have more impact on the regression line if accompanied by leverage. Even an observation with a large distance will not have that much influence if its leverage is low. It is the combination of an observation’s leverage and distance that determines its influence.

The study of influence which is the combination of dispersion versus leverage is therefore a measure to analyze how much an outlier will change the parameter estimates (slope coefficients and summary statistics) and can help disentangle what is the most appropriate treatment particularly what concerns dropping the observation.

Outlier Diagnosis: The Tools.

There are several tools to detect outliers in STATA. We can start with very simple commands that can help early detection of Outliers based on what is called basic descriptive statistics.

Descriptive Techniques:

For example:

. use, clear

. reg var1 var2 var3

*Basic Descriptive stats*

. sum var1 var2 var3

. extremes var1 var2 var3

*Graphic Techniques*

. scatter var1 var2 var3

One of the most helpful diagnosis graphs is provided by leverage/residual squared plot:

. gen id = _n

. lvr2plot, mlabel(id)

*Residual Statistics*

The distance of an observation is based on the error of prediction for the observation: The greater the error of prediction, the greater the distance. We have different types of residuals usually called discrepancy measures: we can calculate standardized residuals (values more extreme than 3 may be a problem) but the most commonly measure for this case is the studentized residual. The studentized residual for an observation is closely related to the error of prediction for that observation divided by the standard deviation of the errors of prediction. However, the predicted score is derived from a regression equation in which the observation in question is not counted. This is preferable to standardized residuals for purposes of outlier identification (values greater or equal to 3 or -3 or less may be problematic).

. predict stdresid, rstandard

. predict rstudent, rstudent

These statistics measure discrepancy, i.e. the difference between predicted Y and the observed Y. But, some outliers will have relatively little influence on the regression line. An extreme value of y that is paired with an average value of X will have less effect than an extreme value of Y that is paired with a non-average value of X. An observation with an extreme value on a predictor variable (or with extreme values on multiple Xs) is called a point with high leverage. The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The greater an observation’s leverage, the more potential it has to be an influential observation. For example, an observation with a value equal to the mean on the predictor variable has no influence on the slope of the regression line regardless of its value on the criterion variable. On the other hand, an observation that is extreme on the predictor variable has the potential to affect the slope greatly.

Some residual statistics therefore measure leverage:

. predict leverage, leverage

Leverage is bound by two limits: 1/n and 1. When leverage >2k/n then there is high leverage.

Influence TESTS:

There are several tests that identify the combination of leverage and dispersion. The influence of an observation can be thought of in terms of how much the predicted scores for other observations would differ if the observation in question were not included. Cook’s D is one of the measures of influence of an observation and is proportional to the sum of the squared differences between predictions made with all observations in the analysis and predictions made leaving out the observation in question. If the predictions are the same with or without the observation in question, then the observation has no influence on the regression model. If the predictions differ greatly when the observation is not included in the analysis, then the observation is influential.

Cook´s distance TEST: According to the Stata 12 Manual, Cook’s distance measures the aggregate change in the estimated coefficients when each observation is left out of the estimation. Values of Cook’s distance that are greater than 4/N may be problematic. A common rule of thumb is that an observation with a value of Cook’s D over 1.0 has too much influence. As with all rules of thumb, this rule should be applied judiciously and not thoughtlessly.

. predict cooksd, cooksd

. sum

DFBETA TEST: It focuses on one coefficient and measures the difference between the regression coefficient when the ith observation is included and excluded, the difference being scaled by the estimated standard error of the coefficient. Observations with dfbetas > 2/Sqrt(N) should be checked as deserving special attention, but it is also common practice to use >1 meaning that the observation shifted the estimate at least one standard error.”

. dfbeta

ROBUST regression and RREG command

In the recurrent dilemma of dropping or not the outliers there are two further statistical tools that can be used: the Robust regression and the RREG estimations. They are both available for least squares estimations but correspond to two different methodologies. Robust regression might be a good strategy if you have no compelling reason to exclude outliers from the analysis, since it is a compromise between excluding these points entirely from the analysis and including all the data points and treating all of them equally in OLS regression. However it should be noted that the robust regression tackles mainly issues of heteroskedasticity so they provide robust (Huber-White-sandwich) standard errors that are more honest standard errors mainly when we have different errors along the distribution. It weights observations differently based on how well behaved these observations are, so roughly speaking, it is a form of weighted and reweighted least squares regression. But it should be noted that robust regression will only weight standard errors but outliers and long tails in any of the variables will have exactly the same effect they have on coefficient estimates.

The STATA`s rreg command is the most appropriate to tackle outliers. And it has a plus: if you want to avoid the struggling dilemma of dropping or not the observation this is your tool: rreg will do it for you. It first runs the OLS regression, gets the Cook’s D for each observation, and then drops any observation with Cook’s distance greater than 1.  Then interaction process begins in which weights are calculated based on absolute residuals.  The interacting stops when the maximum change between the weights from one interaction to the next is below tolerance.  Two types of weights are used.  In Huber weighting, observations with small residuals get a weigh of 1, the larger the residual, the smaller the weigh.  With bi-weighting, all cases with a non-zero residual get down-weighted at least a little.  The two different kinds of weight are used because Huber weights can have difficulties with severe outliers, and bi-weights can have difficulties converging or may yield multiple solutions. In short, the most influential points are dropped, and then cases with large absolute residuals are down-weighted.

Treating Outliers

In the case the outlier has large influence one option is to try to reduce leverage of outliers. There are several techniques that can be used: 1) symmetric trimming (dropping values) which maintain median, 2) replace outliers with median or 3) Winsorize top and bottom observations.

  • Trimming Technique. In this technique the main question is: Where to trim? The difficulty is to choose the appropriate criterion. We can use +/- 3 SDs that assumes normally distributed data but economic variables typically are not normal. We can use physical or biological science bounds or existing standard conventions or pure common sense to define the bounds.
  • Replace with the median is another option that is bolder the higher is the difference between mean and median. In symmetric distributions (Mean ≈ Median) both models will be approximately equal. But in distributions skewed to the left (Mean < Median) replace with the median will increase mean while skewed to the right (Mean > Median) replace with median will decrease mean.
  • Winsorising is the transformation of statistics by limiting extreme values. While in a trimmed estimator, the extreme values are discarded, in a Winsorized estimator, the extreme values are instead replaced by certain percentiles (the trimmed minimum and maximum).

Let´s show the difference between the two techniques through the following example: 90% Winsorisation compared to a 90% Trimming.

Identifying extreme values N=20

{1, 3, 9, 12, 14, 18, 24, 25, 40, 57, 77, 83, 84, 87, 88, 90, 91, 100, 102, 152}

Winsorising N=20

{3, 3, 9, 12, 14, 18, 24, 25, 40, 57, 77, 83, 84, 87, 88, 90, 91, 100, 102, 102}

Trimming N=18

{    3, 9, 12, 14, 18, 24, 25, 40, 57, 77, 83, 84, 87, 88, 90, 91, 100, 102       }


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: